Bug 45375 (mozillametabug) - [meta-bug] Issues with building Mozilla (i.e. Firefox) with LTO
Summary: [meta-bug] Issues with building Mozilla (i.e. Firefox) with LTO
Status: NEW
Alias: mozillametabug
Product: gcc
Classification: Unclassified
Component: lto (show other bugs)
Version: 4.6.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: lto, meta-bug
Depends on: 92711 110334 44871 44897 44904 44950 45089 45194 45679 45791 45934 47234 47247 47410 48207 48354 48508 48724 48761 48763 48836 49086 54312 56570 88561 88702 88858 90273 93318 96058 97295
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-22 09:37 UTC by Jan Hubicka
Modified: 2023-11-22 00:49 UTC (History)
19 users (show)

See Also:
Host:
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2010-08-22 12:40:00


Attachments
Mozilla changes needed. (1.62 KB, patch)
2010-08-22 12:43 UTC, Jan Hubicka
Details | Diff
failing testcase (10.82 KB, application/x-sharedlib)
2011-02-05 22:38 UTC, Jan Hubicka
Details
Mozilla updates needed (1.75 KB, patch)
2011-02-16 17:19 UTC, Jan Hubicka
Details | Diff
-lm.res (3.91 KB, text/plain)
2011-04-07 19:38 UTC, Markus Trippelsdorf
Details
elfhack.wpa.000i.cgraph (53.50 KB, application/x-bzip2)
2011-04-07 19:39 UTC, Markus Trippelsdorf
Details
Output of -Wl,-Map good (6.75 KB, text/plain)
2011-04-08 15:42 UTC, Markus Trippelsdorf
Details
Output of -Wl,-Map bad (6.63 KB, text/plain)
2011-04-08 15:51 UTC, Markus Trippelsdorf
Details
Use size_t for tree code book-keeping (647 bytes, patch)
2012-10-08 22:30 UTC, Steven Bosscher
Details | Diff
Patch to compress line info (1021 bytes, patch)
2013-01-16 17:25 UTC, Jan Hubicka
Details | Diff
alternative patch without the compression. (779 bytes, application/octet-stream)
2013-01-17 14:40 UTC, Jan Hubicka
Details
caching (841 bytes, patch)
2013-01-17 15:13 UTC, Jan Hubicka
Details | Diff
mozilla-central patch (2.78 KB, text/plain)
2014-01-17 19:05 UTC, Markus Trippelsdorf
Details
My local PGO/LTO script (214 bytes, text/plain)
2014-01-17 19:06 UTC, Markus Trippelsdorf
Details
.mozconfig_profile_gen (585 bytes, text/plain)
2014-01-17 19:07 UTC, Markus Trippelsdorf
Details
Memory usage graphs for -flto=9, -flto=4, -flto=1 with -O2 (124.06 KB, application/x-bzip)
2014-04-02 16:25 UTC, Martin Liška
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2010-08-22 09:37:44 UTC
Metabug to track all the issues ;)
Comment 1 Jan Hubicka 2010-08-22 12:39:59 UTC
Quick summary :)
1) -g build is currently broken because of dwarf2out recursion. 
2) sqlite still gets miscompiled at 32bit (PR44897), but works now at 64bit for some reason
3) Workaround attached to PR44846 is needed to avoid ICE due to one decl C++ FE issues
4) 32bit mozilla now builds fine for me when linked with -O2, but -Os (the default) leads to segfault at startup apparently because xpcom components do not reproduce correctly
5) Older versions of gold seems to have issues. 
6) Martin's devirtualization seems to behave funny doing 7400 clones and the redirecting just about 20 calls.
7) Both Martin and Taras reported ICE in lto-symtab I can't reproduce
8) Mozilla needs some changes, since __attribute__ ((used)) is missing. I will attach diff.
9) One needs 4GB in /tmp, with sane partitioning this goes down to 1GB
10) 32bit build gets close to addressing space issues at WPA stage, probably we should not mmap all the .o files, since only about 1GB goes to garbage collected memory.
Comment 2 Jan Hubicka 2010-08-22 12:43:02 UTC
Created attachment 21543 [details]
Mozilla changes needed.
Comment 3 Jan Hubicka 2010-08-22 12:48:19 UTC
mozconfig I use:
export CC="gcc -flto -fuse-linker-plugin"                                                                                                           
export CXX="g++ -fwhopr=24 -fuse-linker-plugin  -fpermissive"                                                                                                                       
              
#export CXX="/builds/slave/tryserver-linux/build/gcc/bin/g++ -fwhopr=16                                                                                                
#-fuse-linker-plugin -static-libstdc++ -fpermissive"                                                                                                                                
                                                                                                                                                           
ac_add_options --enable-application=browser                                                                                                                            
ac_add_options --enable-libxul                                                                                                                                         
#ac_add_options --enable-debug                                                                                                                                        
ac_add_options --enable-optimize                                                                                                                                       
ac_add_options --disable-tests                                                                                                                                         
#ac_add_options --enable-debug-symbols                                                                                                                                 
export LDFLAGS="-Wl,--no-keep-memory"                                                                                                                                  
mk_add_options MOZ_MAKE_FLAGS=-j24                                                                                                                                     
mk_add_options MOZ_OBJDIR=/build-mozilla-scratch-O1
                                                      
Comment 4 Jan Hubicka 2010-08-22 13:10:25 UTC
WPA stage profile after (with sane partitioning).  Decl reading and merging is major issue.  I am surprised we are faster on streaming out than reading.
Execution times (seconds)
 garbage collection    :   5.71 ( 3%) usr   0.00 ( 0%) sys   5.72 ( 3%) wall       0 kB ( 0%) ggc
 callgraph optimization:   1.70 ( 1%) usr   0.00 ( 0%) sys   1.72 ( 1%) wall   13488 kB ( 0%) ggc
 varpool construction  :   0.58 ( 0%) usr   0.01 ( 0%) sys   0.57 ( 0%) wall   43924 kB ( 1%) ggc
 ipa cp                :   1.62 ( 1%) usr   0.02 ( 0%) sys   1.66 ( 1%) wall   70914 kB ( 2%) ggc
 ipa lto gimple in     :   4.28 ( 2%) usr   0.33 ( 4%) sys   4.63 ( 2%) wall      15 kB ( 0%) ggc
 ipa lto gimple out    :   6.45 ( 3%) usr   0.33 ( 4%) sys   6.74 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto decl in       :  48.34 (26%) usr   1.93 (23%) sys  50.30 (26%) wall 3021266 kB (87%) ggc
 ipa lto decl out      :  40.53 (22%) usr   0.19 ( 2%) sys  40.75 (21%) wall       0 kB ( 0%) ggc
 ipa lto decl init I/O :   1.03 ( 1%) usr   0.06 ( 1%) sys   1.08 ( 1%) wall   77094 kB ( 2%) ggc
 ipa lto cgraph I/O    :   0.94 ( 1%) usr   0.21 ( 3%) sys   1.15 ( 1%) wall  237872 kB ( 7%) ggc
 ipa lto decl merge    :  45.14 (24%) usr   1.08 (13%) sys  46.23 (24%) wall     273 kB ( 0%) ggc
 ipa lto cgraph merge  :   0.89 ( 0%) usr   0.00 ( 0%) sys   0.89 ( 0%) wall    5164 kB ( 0%) ggc
 whopr wpa             :   2.38 ( 1%) usr   0.04 ( 0%) sys   2.41 ( 1%) wall       1 kB ( 0%) ggc
 whopr wpa I/O         :   3.08 ( 2%) usr   3.97 (48%) sys   7.38 ( 4%) wall       0 kB ( 0%) ggc
 ipa reference         :   1.55 ( 1%) usr   0.00 ( 0%) sys   1.59 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile           :   0.19 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const        :   1.05 ( 1%) usr   0.00 ( 0%) sys   1.04 ( 1%) wall       0 kB ( 0%) ggc
 parser                :   0.58 ( 0%) usr   0.00 ( 0%) sys   0.58 ( 0%) wall   17738 kB ( 1%) ggc
 inline heuristics     :  15.73 ( 8%) usr   0.00 ( 0%) sys  15.74 ( 8%) wall    2974 kB ( 0%) ggc
 callgraph verifier    :   2.56 ( 1%) usr   0.02 ( 0%) sys   2.59 ( 1%) wall       0 kB ( 0%) ggc
 varconst              :   0.01 ( 0%) usr   0.02 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 186.41             8.27           195.10            3491946 kB
Comment 5 Jan Hubicka 2010-09-04 20:39:18 UTC
Oprofile of WHOPR build.  It is quite suprrising how low the usual cpu hogs shows..
113909    7.6329  lto1                     lto1                     htab_find_slot_with_hash
42787     2.8671  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
36514     2.4468  lto1                     lto1                     iterative_hash_hashval_t
36289     2.4317  libelf.so.0.8.12         libelf.so.0.8.12         /usr/lib64/libelf.so.0.8.12
28366     1.9008  lto1                     lto1                     htab_expand
27648     1.8527  libc-2.11.1.so           libc-2.11.1.so           memset
27045     1.8123  lto1                     lto1                     cgraph_edge_badness
26670     1.7871  lto1                     lto1                     inflate_fast
25955     1.7392  lto1                     lto1                     lto_input_tree
20010     1.3408  lto1                     lto1                     lto_input_uleb128
18853     1.2633  lto1                     lto1                     bitmap_set_bit
16452     1.1024  as                       as                       /usr/bin/as
16215     1.0865  lto1                     lto1                     lto_input_1_unsigned
16141     1.0816  lto1                     lto1                     lto_output_1_stream
15244     1.0215  libc-2.11.1.so           libc-2.11.1.so           memcpy
15241     1.0213  lto1                     lto1                     htab_hash_string
13806     0.9251  lto1                     lto1                     record_reg_classes.constprop.10
13743     0.9209  lto1                     lto1                     lto_output_tree
13220     0.8859  lto1                     lto1                     ggc_internal_alloc_stat
12879     0.8630  libc-2.11.1.so           libc-2.11.1.so           malloc_consolidate
12847     0.8609  libc-2.11.1.so           libc-2.11.1.so           _int_free
11712     0.7848  lto1                     lto1                     lto_streamer_cache_insert_1
11593     0.7768  lto1                     lto1                     linemap_lookup
11100     0.7438  lto1                     lto1                     ht_lookup_with_hash
10837     0.7262  lto1                     lto1                     gtc_visit
10460     0.7009  lto1                     lto1                     cgraph_estimate_growth
10438     0.6994  lto1                     lto1                     value_member
9812      0.6575  lto1                     lto1                     walk_tree_1
9316      0.6243  oprofiled                oprofiled                /usr/bin/oprofiled
8979      0.6017  libc-2.11.1.so           libc-2.11.1.so           malloc
8825      0.5914  libc-2.11.1.so           libc-2.11.1.so           free
8625      0.5780  lto1                     lto1                     pointer_set_insert
8304      0.5564  lto1                     lto1                     ggc_set_mark
8276      0.5546  lto1                     lto1                     type_pair_eq
8089      0.5420  lto1                     lto1                     gimple_types_compatible_p_1
7981      0.5348  lto1                     lto1                     lto_output_uleb128_stream
7388      0.4951  lto1                     lto1                     df_note_compute
7349      0.4924  lto1                     lto1                     operand_equal_p
7349      0.4924  lto1                     lto1                     pointer_map_contains
7117      0.4769  lto1                     lto1                     bitmap_bit_p
7067      0.4736  lto1                     lto1                     pool_alloc
7030      0.4711  lto1                     lto1                     verify_cgraph_node
6954      0.4660  lto1                     lto1                     lto_input_sleb128
6947      0.4655  lto1                     lto1                     gt_ggc_mx_lang_tree_node
6747      0.4521  libc-2.11.1.so           libc-2.11.1.so           calloc
6403      0.4291  lto1                     lto1                     htab_delete
6360      0.4262  lto1                     lto1                     constrain_operands.part.12
6198      0.4153  lto1                     lto1                     bitmap_clear_bit
6103      0.4090  lto1                     lto1                     cse_insn
Comment 6 Jan Hubicka 2010-09-16 12:09:01 UTC
PR 45679 also reproduce during -O3 build.  I am testing patch for it now.
Comment 7 Jan Hubicka 2010-09-17 00:28:22 UTC
Gold shipped with SLES:
GNU gold (GNU Binutils; SUSE Linux Enterprise 11 2.20.0.20100122-0.7.9) 1.
is known to have problems leading to PR45194

The following version: GNU gold (GNU Binutils 2.20.51.20100706) 1.9
works for me.
Comment 8 Jan Hubicka 2010-10-15 01:29:38 UTC
Updated summary...

 - Last patch needed to get Mozilla working is posted as http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01286.html
 - Configuration needs to be done with -fwhopr for C++ and -flto for C, to get around sqlite problem (PR44897)
 - Debugging still needs to be disabled
 - Recent Gold is needed

 - Peak memory use is about 4GB, still more than we should need.  It is WPA stage having too many declarations in it.
 - We probably could do better on devirtualization in constructors for addref.

With -O3 --param inline-unit-growth -fwhopr=jobserv the code size seems comparable with non-LTO -Os build, speed with non-LTO -O3 build.  This seems quite good news.

Lacking debug info build seems to be the only remaining showstopper for practical use.
Comment 9 Jan Hubicka 2010-10-18 20:48:03 UTC
Updated summary, Mozilla now builds with unpatched mainline (with checking disabled)
Comment 10 Jan Hubicka 2010-12-01 23:58:30 UTC
I am just trying to get Mozilla building with GNU ld instead of gold.  First problem is that Mozilla links some of libraries as:

/abuild/jh/trunk-install/bin/gcc  -O3 -flto -flto-partition=none -fuse-linker-plugin -shared -Wl,-soname -Wl,libplds4.so  -o libplds4.so ./plarena.o ./plhash.o ./plvrsion.o    -L/abuild/jh/build-mozilla-new7/dist/lib -lnspr4

i.e. there is missing -fPIC that means that we compile into non-PIC code and GNU LD eventually complains about PC32 relocations into symbols that can be overwritten.

Is this valid? If so, we need to work out -fPIC ourselves at LTO time....

Honza
Comment 11 Jan Hubicka 2010-12-02 00:36:44 UTC
OK,
working around the previous issues we fail with:

/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: gTLSIsMainThread: TLS reference in /tmp/cczRYvg1.ltrans0.ltrans.o mismatches non-TLS definition in nsThreadManager.o.ironly section .text

Dave, is this a GNU LD bug?  It seems to me that most likely that nsThreadManager.o.ironly section is the one got from lto plugin and we don't put TLS annotations there because we have no way to do so?

Honza
Comment 12 Dave Korn 2010-12-02 01:03:43 UTC
(In reply to comment #11)
> OK,
> working around the previous issues we fail with:
> 
> /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld:
> gTLSIsMainThread: TLS reference in /tmp/cczRYvg1.ltrans0.ltrans.o mismatches
> non-TLS definition in nsThreadManager.o.ironly section .text
> 
> Dave, is this a GNU LD bug?  It seems to me that most likely that
> nsThreadManager.o.ironly section is the one got from lto plugin and we don't
> put TLS annotations there because we have no way to do so?

  Yeh, precisely.  The ironly file is a placeholder into which we put the symbols found in the lto symtab so that they can take part in the link and their resolutions be determined.  We have no way of conveying any symbol type info.  We'll need to handle this in the multiple-def linker hook in LD's plugin code, by getting it to copy type info from the newly-added symbols to the ironly ones.

Oh, hang on, that won't work.  elflink.c calls _bfd_elf_merge_symbol /before/ _bfd_generic_link_add_one_symbol, which is where the multiple-def hook gets called back from.  So it'll error on the mismatch before we get a chance to do anything about it.  That's awkward.  Need to scratch my head over that for a bit.
Comment 13 Jan Hubicka 2010-12-02 08:47:28 UTC
>   Yeh, precisely.  The ironly file is a placeholder into which we put the
> symbols found in the lto symtab so that they can take part in the link and
> their resolutions be determined.  We have no way of conveying any symbol type

The error comes out after the lto1 invocation, so why the ironly section is still around?
I would expect it to be discarded at that time and replaced by whatever compiler
returns to you.

On the other hand, discarding won't help if there was non-LTO module referencing
TLS var also used by LTO module I guess.
Comment 14 Dave Korn 2010-12-02 08:52:20 UTC
(In reply to comment #13)
> >   Yeh, precisely.  The ironly file is a placeholder into which we put the
> > symbols found in the lto symtab so that they can take part in the link and
> > their resolutions be determined.  We have no way of conveying any symbol type
> 
> The error comes out after the lto1 invocation, so why the ironly section is
> still around?
> I would expect it to be discarded at that time and replaced by whatever
> compiler
> returns to you.

  It's the symbol from the ironly section that remains, and it gets discarded and replaced by the the symbol from the real object file by the linker multiple_definition callback hook when _bfd_generic_link_add_one_symbol is called to add the symbol from the real object file into the link hash table.

  Unfortunately, the elf linker has some additional checking that it does before calling that routine which preemptively complains about the multiple definition before the linker hook has a chance to replace the original ironly symbol by the new one.
Comment 15 Richard Biener 2010-12-02 09:41:58 UTC
(In reply to comment #10)
> I am just trying to get Mozilla building with GNU ld instead of gold.  First
> problem is that Mozilla links some of libraries as:
> 
> /abuild/jh/trunk-install/bin/gcc  -O3 -flto -flto-partition=none
> -fuse-linker-plugin -shared -Wl,-soname -Wl,libplds4.so  -o libplds4.so
> ./plarena.o ./plhash.o ./plvrsion.o    -L/abuild/jh/build-mozilla-new7/dist/lib
> -lnspr4
> 
> i.e. there is missing -fPIC that means that we compile into non-PIC code and
> GNU LD eventually complains about PC32 relocations into symbols that can be
> overwritten.
> 
> Is this valid? If so, we need to work out -fPIC ourselves at LTO time....

It's valid I think and we try to work out fPIC ourselves in the funny
LTO option handling code (but the options are not re-applied at ltrans
stage I think, so it doesn't work at all with WHOPR).

Richard.

> Honza
Comment 16 Jan Hubicka 2010-12-02 15:34:48 UTC
> It's valid I think and we try to work out fPIC ourselves in the funny
> LTO option handling code (but the options are not re-applied at ltrans
> stage I think, so it doesn't work at all with WHOPR).

Hmm, the link command above is LTO, not WHOPR.  I wonder why we don't work out -fPIC
ourselves then...

Honza
Comment 17 Jan Hubicka 2010-12-12 23:52:52 UTC
Current mainline crashes:
Program received signal SIGSEGV, Segmentation fault.
lto_cgraph_replace_node (slot=<value optimized out>, data=<value optimized out>) at ../../gcc/lto-symtab.c:227                                                                      
227       if (prevailing_node->same_body_alias)
(gdb) bt                                                                                                                                                                            
#0  lto_cgraph_replace_node (slot=<value optimized out>, data=<value optimized out>) at ../../gcc/lto-symtab.c:227
#1  lto_symtab_merge_cgraph_nodes_1 (slot=<value optimized out>, data=<value optimized out>) at ../../gcc/lto-symtab.c:798
#2  0x0000000000b0ae08 in htab_traverse_noresize (htab=<value optimized out>, callback=0x60eca0 <lto_symtab_merge_cgraph_nodes_1>, info=0x0) at ../../libiberty/hashtab.c:784
#3  0x00000000004aabf9 in read_cgraph_and_symbols () at ../../gcc/lto/lto.c:2213
#4  lto_main () at ../../gcc/lto/lto.c:2438
#5  0x00000000006cb658 in compile_file (argc=2627, argv=0x11a7460) at ../../gcc/toplev.c:579
#6  do_compile (argc=2627, argv=0x11a7460) at ../../gcc/toplev.c:1874
#7  toplev_main (argc=2627, argv=0x11a7460) at ../../gcc/toplev.c:1937
#8  0x00007ffff6597bc6 in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000493411 in _start () at ../sysdeps/x86_64/elf/start.S:113

I guess it is fallout of the merging patch. It is weird since previaling_node is NULL.
_moz_cairo_surface_destroy/567259(-1) @0x7ffebef47c60 (asm: _moz_cairo_surface_destroy) visibilit: 2 binds_local
  called by: CreateSimilarSurface/567227 (0.21 per call) CreateSimilarSurface/567227 (0.14 per call) Init/567225 (0.39 per call) _ZN11gfxASurface7ReleaseEv.part.2/567209 (1.00 per call) 
  calls: 
  References: 
  Refering this function: 
$5 = void

I also generated profile.
samples  %        image name               app name                 symbol name
228038   25.3225  lto1                     lto1                     htab_find_slot_with_hash                                                                                        
82588     9.1710  lto1                     lto1                     iterative_hash_hashval_t
58000     6.4406  lto1                     lto1                     type_pair_eq                                                                                                    
32557     3.6153  lto1                     lto1                     gimple_lookup_type_leader
31622     3.5115  lto1                     lto1                     gtc_visit                                                                                                       
29149     3.2369  lto1                     lto1                     htab_expand
27463     3.0496  lto1                     lto1                     gimple_type_hash_1                                                                                              
24348     2.7037  lto1                     lto1                     gimple_types_compatible_p
24217     2.6892  lto1                     lto1                     inflate_fast                                                                                                    
21984     2.4412  lto1                     lto1                     gimple_types_compatible_p_1
21796     2.4203  libc-2.11.1.so           libc-2.11.1.so           memset                                                                                                          
21700     2.4097  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
17894     1.9870  lto1                     lto1                     lookup_type_pair.isra.120.constprop.129                                                                         
16087     1.7864  lto1                     lto1                     ggc_set_mark
15719     1.7455  lto1                     lto1                     gt_ggc_mx_lang_tree_node        

Our abuse of hashing is making us slow.  It is not only type merging but all the hashing during streaming in.
Comment 18 Jan Hubicka 2010-12-14 15:36:47 UTC
Filled in the sefault as PR46940
It is really a sickness of mozilla sources definint _INT symbol, _moz symbol and function of same name and visibility and using both. In any case we should handle this gratefully too.

Honza
Comment 19 Jan Hubicka 2010-12-15 00:44:25 UTC
Filled in the GNU LD bug as http://sourceware.org/bugzilla/show_bug.cgi?id=12323
Comment 20 H.J. Lu 2010-12-17 22:25:56 UTC
(In reply to comment #19)
> Filled in the GNU LD bug as
> http://sourceware.org/bugzilla/show_bug.cgi?id=12323

It should have been fixed on hjl/lto-mixed branch at

http://git.kernel.org/?p=devel/binutils/hjl/x86.git;a=summary
Comment 21 Jan Hubicka 2011-01-05 13:36:37 UTC
I am re-building now.  Martin's edge cgraph_opt streaming fix is needed and flag_shlib needs to be set in lto-options.c
With this fixed oprofile shows that cc1plus spends a lot of time in lookup_filed_1.

40259     5.6000  cc1plus                  cc1plus                  lookup_field_1
20275     2.8203  cc1plus                  cc1plus                  longest_match
15843     2.2038  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
12409     1.7261  libc-2.11.1.so           libc-2.11.1.so           memset
10680     1.4856  cc1plus                  cc1plus                  htab_find_slot_with_hash
10471     1.4565  libc-2.11.1.so           libc-2.11.1.so           vfprintf
9004      1.2525  cc1plus                  cc1plus                  deflate_slow
8580      1.1935  cc1plus                  cc1plus                  ggc_internal_alloc_stat
8300      1.1545  libc-2.11.1.so           libc-2.11.1.so           memcpy
8100      1.1267  cc1plus                  cc1plus                  ht_lookup_with_hash
8044      1.1189  libpython2.6.so.1.0      libpython2.6.so.1.0      /usr/lib64/libpython2.6.so.1.0
7840      1.0905  cc1plus                  cc1plus                  _cpp_lex_direct
6340      0.8819  cc1plus                  cc1plus                  pointer_set_insert

I am adding c++ maintainers to CC as this seems like relatively low hanging fruit for noticeable compilation speedup? It tends to show in oprofile as 5-7% of compile time.
Comment 22 Mark Mitchell 2011-01-06 03:55:40 UTC
On 1/5/2011 5:36 AM, hubicka at gcc dot gnu.org wrote:

> 40259     5.6000  cc1plus                  cc1plus                 
> lookup_field_1

I've looked at this, in the distant past.  I don't think the routine
itself is *very* low-hanging fruit; it's already using an inline log n
algorithm to find a field in most cases, and I bet that's as good as a
hash table since n is generally relatively small.  But, maybe "in most
cases" is wrong; there is a slow-path, and we should confirm that most
of the time is in the fast-path code.

We could also try a bit of memoization; I wouldn't be surprised if we
often lookup "x.y" several times in a row.

More often, when I've looked at this kind of thing, though, I've
concluded that the problem was that we were calling the routine too
often, rather than the routine itself was too slow.  Quite possibly we
could improve algorithms that are using lookup_field_1 so that they
didn't do so as often, by building caches or otherwise.  For that, we'd
need to look at the callers of lookup_field_1.

So, in summary, I'd recommend three things:

* Split lookup_field_1 into its fast-path and slow-path code so that we
can profile it and figure out which code is taking up most of the time.

* Assuming it's fast-path code, look at the frequent callers and think
about how to optimize them.
Comment 23 Jan Hubicka 2011-01-07 18:11:39 UTC
I've updated mozilla tree and rebuilt with top of tree GCC.  The resulting binary seems to work well. Two GCC patches are required:

http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00210.html
solving -fPIC issues (at gold this is silently ignored but we end up with non-PIC shared libraries that is bad for startup time)

http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00375.html
to solve problem with undefined aliases while building libxul.

Mozilla patchset seems same as posted earlier. Will try to move to debug build and try also profile feedback.

memory peaks at 6.5GB, so we will not be able to build in 32bit environment unless we solve the issues with storing too many types.

Honza
Comment 24 Jan Hubicka 2011-01-07 18:21:03 UTC
Author: hubicka
Date: Fri Jan  7 18:21:00 2011
New Revision: 168580

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168580
Log:
	PR lto/45375
	* lto-opt.c (lto_reissue_options): Set flag_shlib.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lto-opts.c
Comment 25 Jan Hubicka 2011-01-08 21:06:27 UTC
With current mainline and release checking compiler, I can for first time build mozilla with debug info.  7.5GB of RAM is needed.
Comment 26 Alexey Feldgendler 2011-01-08 21:10:50 UTC
This is a great success, although I have to say it's still way too much RAM to ask for. In particular, this excludes the possiblity of compiling on a 32-bit architecture.
Comment 27 Jan Hubicka 2011-01-08 21:35:00 UTC
There is a lot of room for improvement in the WPA memory use, but I am not sure how much we can still fit in 4.6.0...
Comment 28 Jan Hubicka 2011-01-10 01:29:19 UTC
With fixes for PR47234 and PR47233 I can build -fprofile-generate libxul. Didn't tried yet if the porfile apply, since build later dies at:
/abuild/jh/trunk-install/bin/g++ -fpermissive -O3 -flto=24 -fuse-linker-plugin -fprofile-generate  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -fno-strict-aliasing -fshort-wchar -pthread -pipe  -DNDEBUG -DTRIMMED -g   -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include   -I/usr/include/gtk-2.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include -I/usr/lib64/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12   -I/usr/include/gtk-2.0 -I/usr/lib64/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -I/usr/include/gtk-unix-print-2.0    -fPIC -shared -Wl,-z,defs -Wl,-h,libmozgnome.so -o libmozgnome.so  nsGnomeModule.o nsAlertsService.o nsAlertsIconListener.o     -lpthread    -Wl,-rpath-link,/abuild/jh/build-mozilla-new8-prof/dist/bin -Wl,-rpath-link,/usr/local/lib /abuild/jh/build-mozilla-new8-prof/dist/lib/libxpcomglue_s.a -L/abuild/jh/build-mozilla-new8-prof/dist/bin -lxpcom -lmozalloc -L/abuild/jh/build-mozilla-new8-prof/dist/bin -lxpcom -lmozalloc -L/abuild/jh/build-mozilla-new8-prof/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl   -lgobject-2.0 -lglib-2.0   -L/lib64 -lnotify -lgtk-x11-2.0 -ldbus-glib-1 -lgdk-x11-2.0 -latk-1.0 -lgio-2.0 -lpangoft2-1.0 -lgdk_pixbuf-2.0 -lpangocairo-1.0 -lcairo -lpango-1.0 -lfreetype -lz -lfontconfig -lgmodule-2.0 -ldbus-1 -lgobject-2.0 -lglib-2.0     -Wl,--version-script -Wl,/abuild/jh/mozilla-central2/mozilla-central/build/unix/gnu-ld-scripts/components-version-script -Wl,-Bsymbolic -ldl    
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x10): error: undefined reference to 'gfxASurface::BeginPrinting(nsAString const&, nsAString const&)'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x18): error: undefined reference to 'gfxASurface::EndPrinting()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x20): error: undefined reference to 'gfxASurface::AbortPrinting()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x28): error: undefined reference to 'gfxASurface::BeginPage()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x30): error: undefined reference to 'gfxASurface::EndPage()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x38): error: undefined reference to 'gfxASurface::Finish()'
/abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x40): error: undefined reference to 'gfxASurface::CreateSimilarSurface(gfxASurface::gfxContentType, gfxIntSize const&)'

those seems suspicious.  I saw similar problem previously - the vtables are there but they are not finalized.  The non-LTO objects don't seem to reffer to them, so perhaps we do too much of folding...
I am bit lost.
Comment 29 Jan Hubicka 2011-01-10 01:59:31 UTC
... and hacking around, the profile doesn't read back even with -fprofile-correction
/abuild/jh/trunk-install/bin/gcc  -O3 -flto -flto-partition=none -fuse-linker-plugin -fprofile-correction -fprofile-use -o jemalloc.o -c  -DOSTYPE=\"Linux2.6.32.12-0\" -DOSARCH=Linux  -I/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc -I. -I../../dist/include -I../../dist/include/nsprpub  -I/abuild/jh/build-mozilla-new8-prof/dist/include/nspr -I/abuild/jh/build-mozilla-new8-prof/dist/include/nss       -fPIC  -Wall -W -Wno-unused -Wpointer-arith -Wcast-align -W -pedantic -Wno-long-long -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -g   -include ../../mozilla-config.h -DMOZILLA_CLIENT -MD -MF .deps/jemalloc.pp /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c
/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c: In function 'arena_malloc':
/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c:6530:1: note: correcting inconsistent profile data
/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c: In function 'malloc_mutex_unlock':
/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c:6530:1: error: corrupted profile info: edge from 0 to 2 exceeds maximal count
/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c: In function 'malloc_mutex_lock':
/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c:6530:1: error: corrupted profile info: edge from 2 to 3 exceeds maximal count

will see if this reproduce w/o LTO.
Comment 30 Jan Hubicka 2011-01-10 16:39:08 UTC
The libmoznome build issue is now Mozilla PR https://bugzilla.mozilla.org/show_bug.cgi?id=624385
Comment 31 Jan Hubicka 2011-01-10 22:51:06 UTC
Mozilla now builds with profile feedback and LTO.
One needs to train without LTO (i.e. -fprofile-generate -O3 only) and then build with LTO (-fprofile-use -O3 -flto) becase of the aforementioned problems with undefined symbols.

Resulting binary works, except for libmozsqlite that gets misoptimized (PR44897).
http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00375.html is still needed at the GCC side.
Comment 32 Jan Hubicka 2011-01-10 23:37:14 UTC
Author: hubicka
Date: Mon Jan 10 23:37:11 2011
New Revision: 168643

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168643
Log:
	PR lto/45375
	* profile.c (read_profile_edge_counts): Ignore profile inconistency
	when correcting profile.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/profile.c
Comment 33 Jan Hubicka 2011-01-10 23:37:48 UTC
Author: hubicka
Date: Mon Jan 10 23:37:45 2011
New Revision: 168644

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168644
Log:
	PR lto/45375
	* lto-cgraph.c (input_profile_summary): Remove overactive sanity check.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lto-cgraph.c
Comment 34 Jan Hubicka 2011-01-11 17:29:57 UTC
Author: hubicka
Date: Tue Jan 11 17:29:52 2011
New Revision: 168666

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168666
Log:

	PR lto/45721
	PR lto/45375
	* tree.h (symbol_alias_set_t): Move typedef here from varasm.c
	(symbol_alias_set_destroy, symbol_alias_set_contains,
	propagate_aliases_backward): Declare.
	* lto-streamer-out.c (struct sets): New sturcture.
	(trivally_defined_alias): New function.
	(output_alias_pair_p): Rewrite.
	(output_unreferenced_globals): Fix output of alias pairs.
	(produce_symtab): Likewise.
	* ipa.c (function_and_variable_visibility): Set weak alias destination
	as needed in lto.
	* varasm.c (symbol_alias_set_t): Remove.
	(symbol_alias_set_destroy): Export.
	(propagate_aliases_forward, propagate_aliases_backward): New functions
	based on ...
	(compute_visible_aliases): ... this one; remove.
	(trivially_visible_alias): New
	(trivially_defined_alias): New.
	(remove_unreachable_alias_pairs): Rewrite.
	(finish_aliases_1): Reorganize code checking if alias is defined.
	* passes.c (rest_of_decl_compilation): Do not call assemble_alias when
	in LTO mode.

	* lto.c (partition_cgraph_node_p, partition_varpool_node_p): Weakrefs are
	not partitioned.

	* testsuite/gcc.dg/lto/pr45721_1.c: New file.
	* testsuite/gcc.dg/lto/pr45721_0.c: New file.

Added:
    trunk/gcc/testsuite/gcc.dg/lto/pr45721_0.c
    trunk/gcc/testsuite/gcc.dg/lto/pr45721_1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lto-streamer-out.c
    trunk/gcc/lto/ChangeLog
    trunk/gcc/lto/lto.c
    trunk/gcc/passes.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree.h
    trunk/gcc/varasm.c
Comment 35 Jan Hubicka 2011-01-15 16:42:06 UTC
I looked briefly into effectivity of the devirtualization bits and they don't seem to work terribly well.
In GCC 4.3 -O3 copmiled libxul there are 82155 indirect calls.
In mainline -O3 libxul there are 83023 and with LTO there are 87763.

The ipa-prop bits at LTO devirtualize 1 call that is consequently optimized away (since -fno-devirtualize seems same to -fdevirtualize).

I will give a try http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01214.html

However we _really_ need testcases from Mozilla where devirtualization is valid and we don't do it.
Comment 36 Jan Hubicka 2011-01-15 17:21:16 UTC
Hmm, the patch makes no difference, but I also see failure in its testcase
FAIL: g++.dg/ipa/imm-devirt-1.C scan-tree-dump optimized "= B::.*foo"
FAIL: g++.dg/ipa/imm-devirt-2.C scan-tree-dump optimized "= B::.*foo"
so I will wait for Martin to commit rest of his series and/or update the patch.
Comment 37 Jan Hubicka 2011-01-20 10:21:06 UTC
I tested Martin's devirtualization patch at cgraph build. The net result is decrease of number of indirect calls in libxul by 2. The code size decrease by about 3KB, so there is probably more devirtualization happening than just 2 calls but the subsequent inlining increase final number of virtual calls again.

So for 4.6.0 we won't seee more improvmeents and we can look into improving devirtualization at 4.7.  But having an testcases that can be resolved by other compilers, but not by GCC is a must.
Comment 38 Jan Hubicka 2011-02-05 22:38:41 UTC
Created attachment 23253 [details]
failing testcase

With current mainline and top of tree mozilla, things seems to go well, sqlite issues are gone.  I now however get elfhack fault:

jh@evans:/abuild/jh/build-mozilla-new9/build/unix/elfhack> /abuild/jh/build-mozilla-new9/build/unix/elfhack/elfhack -b test.so
test.so: terminate called after throwing an instance of 'std::runtime_error'
  what():  Section index out of bounds
Aborted (core dumped)

I am attaching test.so I get to see if it is elfhack miscomplation or the binary.
Comment 39 Mike Hommey 2011-02-07 18:40:22 UTC
(In reply to comment #38)
> Created attachment 23253 [details]
> failing testcase
> 
> With current mainline and top of tree mozilla, things seems to go well, sqlite
> issues are gone.  I now however get elfhack fault:
> 
> jh@evans:/abuild/jh/build-mozilla-new9/build/unix/elfhack>
> /abuild/jh/build-mozilla-new9/build/unix/elfhack/elfhack -b test.so
> test.so: terminate called after throwing an instance of 'std::runtime_error'
>   what():  Section index out of bounds
> Aborted (core dumped)
> 
> I am attaching test.so I get to see if it is elfhack miscomplation or the
> binary.

That could well be https://bugzilla.mozilla.org/show_bug.cgi?id=629638
Can you check with a changeset newer than http://hg.mozilla.org/mozilla-central/rev/2772a0cf36d1 ?
Comment 40 Martin Jambor 2011-02-09 14:12:57 UTC
(In reply to comment #39)
> That could well be https://bugzilla.mozilla.org/show_bug.cgi?id=629638
> Can you check with a changeset newer than
> http://hg.mozilla.org/mozilla-central/rev/2772a0cf36d1 ?

I have just checked-out mozilla-central entirely by doing 

hg clone http://hg.mozilla.org/mozilla-central/

and the elfhack test still segfaults for me (with lto).
Comment 41 Mike Hommey 2011-02-09 14:34:08 UTC
(In reply to comment #40)
> I have just checked-out mozilla-central entirely by doing 
> 
> hg clone http://hg.mozilla.org/mozilla-central/
> 
> and the elfhack test still segfaults for me (with lto).

Segfaults or aborts ?
Comment 42 Martin Jambor 2011-02-10 17:35:36 UTC
(In reply to comment #41)
> 
> Segfaults or aborts ?

Segfaults:

===
=== If you get failures below, please file a bug describing the error
=== and your environment (compiler and linker versions), and use
=== --disable-elf-hack until this is fixed.
===
/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/elfhack -b test.so
test.so: Reduced by 12128 bytes
# Fail if the backup file doesn't exist
[ -f "test.so.bak" ]
# Fail if the new library doesn't contain less relocations
[ $(objdump -R test.so.bak | wc -l) -gt $(objdump -R test.so | wc -l) ]
/home/mjambor/gcc/icln/inst/bin/gcc -o dummy dummy.o test.so
# Will either crash or return exit code 1 if elfhack is broken
LD_LIBRARY_PATH=/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/dummy
make[6]: *** [libs] Segmentation fault
make[6]: Leaving directory `/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack'

...and very early on it seems:

(gdb) bt
#0  0x00007ffff7ff7040 in frame_dummy ()
   from /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so
#1  0x00007ffff7ff6f5e in _init () from /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so
#2  0x00007ffff7ffa710 in ?? ()
#3  0x00007ffff7debe18 in call_init () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7debf47 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7ddeb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
Comment 43 Mike Hommey 2011-02-10 17:41:53 UTC
(In reply to comment #42)
> (In reply to comment #41)
> > 
> > Segfaults or aborts ?
> 
> Segfaults:
> 
> ===
> === If you get failures below, please file a bug describing the error
> === and your environment (compiler and linker versions), and use
> === --disable-elf-hack until this is fixed.
> ===
> /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/elfhack -b
> test.so
> test.so: Reduced by 12128 bytes
> # Fail if the backup file doesn't exist
> [ -f "test.so.bak" ]
> # Fail if the new library doesn't contain less relocations
> [ $(objdump -R test.so.bak | wc -l) -gt $(objdump -R test.so | wc -l) ]
> /home/mjambor/gcc/icln/inst/bin/gcc -o dummy dummy.o test.so
> # Will either crash or return exit code 1 if elfhack is broken
> LD_LIBRARY_PATH=/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack
> /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/dummy
> make[6]: *** [libs] Segmentation fault
> make[6]: Leaving directory
> `/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack'
> 
> ...and very early on it seems:
> 
> (gdb) bt
> #0  0x00007ffff7ff7040 in frame_dummy ()
>    from /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so
> #1  0x00007ffff7ff6f5e in _init () from
> /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so
> #2  0x00007ffff7ffa710 in ?? ()
> #3  0x00007ffff7debe18 in call_init () from /lib64/ld-linux-x86-64.so.2
> #4  0x00007ffff7debf47 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
> #5  0x00007ffff7ddeb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2

Ah, so this is a crash of the test, not of elfhack. Could you attach both test.so and test.so.bak files ?
Comment 44 Mike Hommey 2011-02-10 17:43:04 UTC
(In reply to comment #43)
> Ah, so this is a crash of the test, not of elfhack. Could you attach both
> test.so and test.so.bak files ?

Actually, it would be better to just do that on bugzilla.mozilla.org. (please Cc ":glandium" there)
Comment 45 Mike Hommey 2011-02-12 09:32:34 UTC
Can you try mozilla-central revision 19f13dea4d4a?
Comment 46 Martin Jambor 2011-02-13 12:41:29 UTC
(In reply to comment #45)
> Can you try mozilla-central revision 19f13dea4d4a?

With that revision the elfhack problems are gone.  Thanks!
Comment 47 Martin Jambor 2011-02-16 16:30:31 UTC
With the elfhack issues gone, the build now fails with:

----------------------------------------------------------------------

/home/mjambor/gcc/icln/inst/bin/g++ -o js  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -O2 -flto=jobserver -fpermissive -fuse-linker-plugin -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -Os -freorder-blocks -fomit-frame-pointer   js.o jsworkers.o   -lpthread -O2 -flto=jobserver -fuse-linker-plugin   -Wl,-rpath-link,/bin -Wl,-rpath-link,/home/mjambor/mozilla/lto/objdir-ff-release/dist/lib  -L../../../dist/bin -L../../../dist/lib -L/home/mjambor/mozilla/lto/objdir-ff-release/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl ../editline/libeditline.a ../libjs_static.a -ldl
make[6]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
/home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x33): error: undefined reference to 'SetVMFrameRegs'
/home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x3b): error: undefined reference to 'PushActiveVMFrame'
/home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x4d): error: undefined reference to 'PopActiveVMFrame'
/home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x6b): error: undefined reference to 'js_InternalThrow'
/home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x7a): error: undefined reference to 'PopActiveVMFrame'
collect2: ld returned 1 exit status
make[5]: *** [js] Error 1
make[5]: Leaving directory `/home/mjambor/mozilla/lto/objdir-ff-release/js/src/shell'

----------------------------------------------------------------------

I have not been able to have a closer look at the issue yet but hope
to do so soon.
Comment 48 Jan Hubicka 2011-02-16 17:19:47 UTC
Created attachment 23364 [details]
Mozilla updates needed

Updated mozilla patch fixing the undefined symbols Martin reported.
Sorry, had it in tree for a while, but didn't noticed PR is out of date.
Comment 49 Martin Jambor 2011-02-17 13:15:48 UTC
(In reply to comment #48)
> Updated mozilla patch fixing the undefined symbols Martin reported.
> Sorry, had it in tree for a while, but didn't noticed PR is out of date.

Thanks,  that resolved these issues.  However, now my 8GB machine runs
out of memory when linking libxul.so.
Comment 50 Jan Hubicka 2011-02-17 15:16:19 UTC
> Thanks,  that resolved these issues.  However, now my 8GB machine runs
> out of memory when linking libxul.so.

That is expected. With richard's -g fixes memory usage is slightly over 8GB.
Just add some swap, since it get over 8GB for short time during WPA it might
not be that bad.

Honza
Comment 51 Martin Jambor 2011-02-18 12:30:08 UTC
I tried again on a machine with more RAM and LTO build succeeded for me as well.  Thanks a lot.
Comment 52 Markus Trippelsdorf 2011-03-09 12:37:00 UTC
Just a warning: Building a -fprofile-generate libxul uses
~13GB of memory. (I have 8GB on my build-system and lto1
got killed several times by the OOM killer, until I added
enough swap space.)
The build process still fails later on as described in Comment 28.
Comment 53 Markus Trippelsdorf 2011-03-09 13:46:39 UTC
Building fails with GNU ld (Linux/GNU Binutils) 2.21.51.0.7.20110306:

c++ -o xpcshell -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -fshort-wchar -pthread -pipe -DNDEBUG -DTRIMMED -O3  xpcshell.o   -lpthread -Wl,-O1,--hash-style=gnu,--as-needed,--no-keep-memory  -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/bin -Wl,-rpath-link,/usr/lib  -L../../../../dist/bin -L../../../../dist/lib ../../../../dist/lib/libxpcomglue_s.a -L/var/tmp/mozilla-central/moz-build-dir/dist/bin -lxpcom -lmozalloc -lxul  -L/var/tmp/mozilla-central/moz-build-dir/dist/bin -lxpcom -lmozalloc -lxul   -Wl,-R/usr/lib64 -L/usr/lib64 -lplds4 -lplc4 -lnspr4 -lpthread -ldl -ldl
../../../../dist/bin/libxul.so: undefined reference to `PR_smprintf_free'
../../../../dist/bin/libxul.so: undefined reference to `PR_SetEnv'
../../../../dist/bin/libxul.so: undefined reference to `PR_Now'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetErrorText'
../../../../dist/bin/libxul.so: undefined reference to `PR_FindFunctionSymbol'
../../../../dist/bin/libxul.so: undefined reference to `PR_PushIOLayer'
../../../../dist/bin/libxul.so: undefined reference to `PR_ntohs'
../../../../dist/bin/libxul.so: undefined reference to `PR_FormatTimeUSEnglish'
../../../../dist/bin/libxul.so: undefined reference to `PR_MemMap'
../../../../dist/bin/libxul.so: undefined reference to `PR_LocalTimeParameters'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetDefaultIOMethods'
../../../../dist/bin/libxul.so: undefined reference to `PR_ReadDir'
../../../../dist/bin/libxul.so: undefined reference to `PR_SetPollableEvent'
../../../../dist/bin/libxul.so: undefined reference to `PR_FindSymbol'
/usr/lib/libssl3.so: undefined reference to `PR_OpenAnonFileMap'
/usr/lib/libssl3.so: undefined reference to `PR_ExportFileMapAsString'
../../../../dist/bin/libxul.so: undefined reference to `PR_Delete'
../../../../dist/bin/libxul.so: undefined reference to `PR_AtomicSet'
/usr/lib/libnss3.so: undefined reference to `PR_NewRWLock'
../../../../dist/bin/libxul.so: undefined reference to `PR_SetNetAddr'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetNumberOfProcessors'
../../../../dist/bin/libxul.so: undefined reference to `PR_SecondsToInterval'
../../../../dist/bin/libxul.so: undefined reference to `PR_Close'
../../../../dist/bin/libxul.so: undefined reference to `PR_vsprintf_append'
../../../../dist/bin/libxul.so: undefined reference to `PR_Bind'
../../../../dist/bin/libxul.so: undefined reference to `PR_Sleep'
../../../../dist/bin/libxul.so: undefined reference to `PR_OpenTCPSocket'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetRandomNoise'
../../../../dist/bin/libxul.so: undefined reference to `PR_Send'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetPhysicalMemorySize'
../../../../dist/bin/libxul.so: undefined reference to `PR_NotifyAllCondVar'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetUniqueIdentity'
../../../../dist/bin/libxul.so: undefined reference to `PR_ConnectContinue'
../../../../dist/bin/libxul.so: undefined reference to `PR_snprintf'
../../../../dist/bin/libxul.so: undefined reference to `PR_CreateFileMap'
/usr/lib/libnss3.so: undefined reference to `PR_NewTCPSocket'
/usr/lib64/libplc4.so: undefined reference to `PR_Assert'
../../../../dist/bin/libxul.so: undefined reference to `PR_htons'
../../../../dist/bin/libxul.so: undefined reference to `PR_FreeAddrInfo'
/usr/lib/libnss3.so: undefined reference to `PR_Shutdown'
/usr/lib/libssl3.so: undefined reference to `PR_ImportFileMapFromString'
/usr/lib/libnss3.so: undefined reference to `PR_EnumerateHostEnt'
../../../../dist/bin/libxul.so: undefined reference to `PR_Malloc'
/usr/lib/libnss3.so: undefined reference to `PR_SetErrorText'
../../../../dist/bin/libxul.so: undefined reference to `PR_EnumerateAddrInfo'
../../../../dist/bin/libxul.so: undefined reference to `PR_ConvertIPv4AddrToIPv6'
../../../../dist/bin/libxul.so: undefined reference to `PR_WaitProcess'
../../../../dist/bin/libxul.so: undefined reference to `PR_IntervalNow'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetHostByName'
../../../../dist/bin/libxul.so: undefined reference to `LL_MaxUint'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetSocketOption'
../../../../dist/bin/libxul.so: undefined reference to `PR_Free'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetPageShift'
../../../../dist/bin/libxul.so: undefined reference to `PR_LogPrint'
../../../../dist/bin/libxul.so: undefined reference to `PR_JoinThread'
/usr/lib/libnss3.so: undefined reference to `PR_VersionCheck'
../../../../dist/bin/libxul.so: undefined reference to `PR_NewThreadPrivateIndex'
../../../../dist/bin/libxul.so: undefined reference to `PR_IsNetAddrType'
../../../../dist/bin/libxul.so: undefined reference to `PR_vsmprintf'
../../../../dist/bin/libxul.so: undefined reference to `PR_Recv'
../../../../dist/bin/libxul.so: undefined reference to `PR_strtod'
../../../../dist/bin/libxul.so: undefined reference to `PR_Notify'
../../../../dist/bin/libxul.so: undefined reference to `PR_Poll'
../../../../dist/bin/libxul.so: undefined reference to `PR_CeilingLog2'
../../../../dist/bin/libxul.so: undefined reference to `PR_SetSocketOption'
../../../../dist/bin/libxul.so: undefined reference to `PR_OpenUDPSocket'
../../../../dist/bin/libxul.so: undefined reference to `PR_PopIOLayer'
../../../../dist/bin/libxul.so: undefined reference to `PR_LoadLibraryWithFlags'
../../../../dist/bin/libxul.so: undefined reference to `PR_dtoa'
../../../../dist/bin/libxul.so: undefined reference to `PR_AtomicDecrement'
../../../../dist/bin/libxul.so: undefined reference to `PR_GetEnv'
/usr/lib/libssl3.so: undefined reference to `PR_Interrupt'
...

gold (1.11) works fine.
Comment 54 Markus Trippelsdorf 2011-03-09 14:40:25 UTC
Turned out that GNU ld doesn't like "--as-needed";
LDFLAGS="-Wl,-O1,--hash-style=gnu,--no-keep-memory" works fine.
(although GNU ld uses way more memory than gold.)
Comment 55 Jan Hubicka 2011-03-09 19:17:26 UTC
> Just a warning: Building a -fprofile-generate libxul uses
> ~13GB of memory. (I have 8GB on my build-system and lto1
> got killed several times by the OOM killer, until I added
> enough swap space.)
> The build process still fails later on as described in Comment 28.

You can build -fprofile-generate without -flto and use -flto only for final build.
It produce same results and save _alot_ of memory ;)

Honza
Comment 56 Jan Hubicka 2011-03-09 19:19:53 UTC
> Turned out that GNU ld doesn't like "--as-needed";
> LDFLAGS="-Wl,-O1,--hash-style=gnu,--no-keep-memory" works fine.
> (although GNU ld uses way more memory than gold.)

Hmm, seems like GNU LD bug to me (tough I never used --as-needed)
Could you fill it in, please?

Honza
Comment 57 Markus Trippelsdorf 2011-03-09 19:49:56 UTC
(In reply to comment #56)
> > Turned out that GNU ld doesn't like "--as-needed";
> > LDFLAGS="-Wl,-O1,--hash-style=gnu,--no-keep-memory" works fine.
> > (although GNU ld uses way more memory than gold.)
> 
> Hmm, seems like GNU LD bug to me (tough I never used --as-needed)
> Could you fill it in, please?

Done: http://sourceware.org/bugzilla/show_bug.cgi?id=12557

>You can build -fprofile-generate without -flto and use -flto only for final
>build.

How do you do this with "make -f client.mk profiledbuild"?
Or do you run both phases by hand?
Comment 58 Markus Trippelsdorf 2011-03-09 21:45:10 UTC
> How do you do this with "make -f client.mk profiledbuild"?

To answer my own question:
Just edit ./configure and ./js/src/configure and add
"-flto=4 -fwhole-program" (or whatever you may prefer)
to the PROFILE_USE_CFLAGS variable.
Then you can build Firefox with "make -f client.mk profiledbuild".

BTW libmozsqlite3.so still gets miscompiled, but Firefox is 
now snappy as never before ;-)
Comment 59 Jan Hubicka 2011-03-10 12:53:58 UTC
> > How do you do this with "make -f client.mk profiledbuild"?
> 
> To answer my own question:
> Just edit ./configure and ./js/src/configure and add
> "-flto=4 -fwhole-program" (or whatever you may prefer)
> to the PROFILE_USE_CFLAGS variable.
> Then you can build Firefox with "make -f client.mk profiledbuild".

I did not know of existence of profiledbuild and thus I did that by hand
where it was easy.
Perhaps Mozilla build mahcinery can be told to add -fno-lto into -fprofile-generate
run.  Hmm, in fact perhaps GCC chould do that by default. Not sure if it is not too
late for 4.6 however.
> 
> BTW libmozsqlite3.so still gets miscompiled, but Firefox is 
> now snappy as never before ;-)

yes, there is PR on this, but I have absolutely no idea if it is sqlite or GCC bug.
Any help is greatly appreciated, sqlite is big blob of magic for me.
Comment 60 Markus Trippelsdorf 2011-03-23 13:10:50 UTC
Latest mozilla-central fails here:

make[5]: Entering directory `/var/tmp/mozilla-central/moz-build-dir/js/src/shell'
js.cpp
c++ -o js.o -c  -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux2.6\"
 -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/ns
prpub  -I/usr/include/nspr    -fPIC  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-vir
tual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -fpermissive -flto=4 -fu
se-linker-plugin -fwhole-program -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -g -O3   -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/js.pp /var/tmp/mozilla-central/js/src/shell/js.cpp
jsworkers.cpp
c++ -o jsworkers.o -c  -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux2.6\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/nsprpub  -I/usr/include/nspr    -fPIC  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -g -O3   -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/jsworkers.pp /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp
In file included from /var/tmp/mozilla-central/js/src/shell/js.cpp:97:0:
/var/tmp/mozilla-central/js/src/jsobjinlines.h: In member function ‘void JSObject::setArrayLength(uint32)’:
/var/tmp/mozilla-central/js/src/jsobjinlines.h:316:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
/usr/bin/python2.7 /var/tmp/mozilla-central/js/src/config/pythonpath.py -I../config /var/tmp/mozilla-central/js/src/config/expandlibs_exec.py --uselist --  c++ -o js  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -g -O3  js.o jsworkers.o   -lpthread -Wl,-O1,--hash-style=gnu,--as-needed,--no-keep-memory   -Wl,-rpath-link,/bin -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/lib  -L../../../dist/bin -L../../../dist/lib -Wl,-R/usr/lib64 -L/usr/lib64 -lplds4 -lplc4 -lnspr4 -lpthread -ldl ../editline/libeditline.a ../libjs_static.a -ldl
lto1: internal compiler error: in output_die, at dwarf2out.c:11355
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make[6]: *** [/tmp/ccC5KSYt.ltrans18.ltrans.o] Error 1
make[6]: *** Waiting for unfinished jobs....
lto-wrapper: make returned 2 exit status
/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.0/../../../../x86_64-pc-linux-gnu/bin/ld: fatal error: lto-wrapper failed
collect2: ld returned 1 exit status
Comment 61 Jan Hubicka 2011-04-03 08:34:01 UTC
My tree still builds (this is debug info ICE and I use non-debug info by default).  Will update tree and try to reproduce it.  Would be handy to have a testcase.
Comment 62 Jan Hubicka 2011-04-03 08:36:47 UTC
and since it doesn't fail at link time, this is debug info bug, not LTO, so if you get a testcase, please open a new PR.
Comment 63 Jan Hubicka 2011-04-03 09:09:03 UTC
Some stats on size of the compilation unit...
There is 4.5GB of GGC memory, it gets down to 3.9MB after type merging and 3.1MB after cgraph merging.


GIMPLE type table: size 524287, 374001 elements, 4447259 searches, 70070870 collisions (ratio: 15.755968)
GIMPLE type hash table: size 8388593, 3907773 elements, 325621199 searches, 247539125 collisions (ratio: 0.760206)
GIMPLE canonical type table: size 262139, 182719 elements, 655793 searches, 1461075 collisions (ratio: 2.227952)
GIMPLE canonical type hash table: size 2097143, 863737 elements, 30341039 searches, 17653238 collisions (ratio: 0.581827)
GIMPLE type comparison table: size 134217689, 70698639 elements, 153291912 searches, 154719852 collisions (ratio: 1.009315)

[WPA] # of input files: 2721
[WPA] # of input cgraph nodes: 127466
[WPA] # of function bodies: 0
[WPA] GIMPLE type table: size 16381, 55 elements, 55 searches, 2 collisions (ratio: 0.036364)

there are overall 600K cgraph nodes before merging, 127K from those do have function bodies.

MMAP pool
[WPA] Compression: 680146043 input bytes, 2436118544 uncompressed bytes (ratio: 3.581758)
[WPA] Size of mmap'd section decls: 421187330 bytes
[WPA] Size of mmap'd section function_body: 232170973 bytes
[WPA] Size of mmap'd section statics: 9978045 bytes
[WPA] Size of mmap'd section cgraph: 6356885 bytes
[WPA] Size of mmap'd section vars: 225276 bytes
[WPA] Size of mmap'd section refs: 1082929 bytes
[WPA] Size of mmap'd section jmpfuncs: 8401591 bytes
[WPA] Size of mmap'd section pureconst: 743014 bytes
Comment 64 Jan Hubicka 2011-04-03 10:08:34 UTC
Some detailed stats on WPA memory usage.
Before IPA:



ipa-prop.c:2820 (ipa_read_node_info)                      0: 0.0%    8895232: 1.1%   24998944: 0.7%     395040: 0.1%     558297
tree.c:5898 (decl_priority_info)                   12295536: 0.7%          0: 0.0%   27391696: 0.8%          0: 0.0%    2480452
tree.c:1567 (build_string)                         16376223: 0.9%          0: 0.0%   39728388: 1.2%    4876275: 1.1%    1227602
lto-section-in.c:435 (lto_new_in_decl_state)           2280: 0.0%          0: 0.0%   44349120: 1.3%          0: 0.0%     369595
ipa-ref.c:54 (ipa_record_reference)                       0: 0.0%  117135752:14.1%   45299512: 1.3%   38560128: 8.5%     488972
lto-streamer-in.c:1875 (lto_materialize_tree)      44134352: 2.5%          0: 0.0%   66615480: 1.9%       4264: 0.0%    1107669
ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw       1480: 0.0%  250512784:30.1%   67551704: 2.0%     157632: 0.0%       7072
cgraph.c:1015 (cgraph_create_edge_1)                      0: 0.0%          0: 0.0%   68064464: 2.0%          0: 0.0%     654466
lto-streamer-in.c:2307 (lto_input_ts_constructor   33062632: 1.9%  111658560:13.4%  102441008: 3.0%   56848328:12.6%     486571
lto/lto.c:214 (lto_read_in_decl_state)                 2288: 0.0%          0: 0.0%  110826912: 3.2%   21320304: 4.7%    2587165
tree.c:1257 (build_int_cst_wide)                  143425600: 8.1%          0: 0.0%  199678728: 5.8%  113095664:25.0%      60257
cgraph.c:459 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  236635872: 6.9%          0: 0.0%     672261
toplev.c:1027 (realloc_for_line_map)                      0: 0.0%  335593472:40.4%  335550464: 9.8%  134297600:29.7%         15
lto-streamer-in.c:1881 (lto_materialize_tree)    1302081688:73.2%          0: 0.0% 1968493840:57.3%   74550688:16.5%   29259517
Total                                            1777935767        831048528       3436852692        452441891         49428016
source location                                     Garbage            Freed             Leak         Overhead            Times
-------------------------------------------------------

after IPA
stringpool.c:75 (alloc_node)                              0: 0.0%          0: 0.0%   17709680: 0.5%          0: 0.0%     442742
stringpool.c:58 (stringpool_ggc_alloc)                    0: 0.0%          0: 0.0%   22641304: 0.7%    1646320: 0.3%     442742
tree.c:1297 (build_int_cst_wide)                   10611640: 0.6%          0: 0.0%   21902960: 0.6%          0: 0.0%     812865
tree.c:5898 (decl_priority_info)                   12376576: 0.7%          0: 0.0%   27310672: 0.8%          0: 0.0%    2480453
lto-section-in.c:435 (lto_new_in_decl_state)         162720: 0.0%          0: 0.0%   44188680: 1.3%          0: 0.0%     369595
tree.c:1567 (build_string)                         17659049: 1.0%          0: 0.0%   38445562: 1.1%    4876275: 1.0%    1227602
cgraph.c:1015 (cgraph_create_edge_1)                      0: 0.0%          0: 0.0%   68064464: 2.0%          0: 0.0%     654466
ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw      26888: 0.0%  258338128:27.6%   75336800: 2.2%     171272: 0.0%       7667
gimple.c:4187 (iterative_hash_gimple_type)         78311648: 4.3%          0: 0.0%     260960: 0.0%          0: 0.0%    4910788
ipa-ref.c:54 (ipa_record_reference)                       0: 0.0%  156312592:16.7%   82529352: 2.4%   63464176:13.2%     506799
lto-streamer-in.c:1875 (lto_materialize_tree)      49735872: 2.8%          0: 0.0%   61013960: 1.8%       4264: 0.0%    1107669
lto/lto.c:214 (lto_read_in_decl_state)               315616: 0.0%          0: 0.0%  110513584: 3.2%   21320304: 4.4%    2587165
lto-symtab.c:156 (lto_symtab_register_decl)       130991616: 7.3%          0: 0.0%    2900408: 0.1%          0: 0.0%    2390929
lto-streamer-in.c:2307 (lto_input_ts_constructor   33062632: 1.8%  111658560:12.0%  102441008: 3.0%   56848328:11.8%     486571
cgraph.c:459 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  236635872: 6.9%          0: 0.0%     672261
toplev.c:1027 (realloc_for_line_map)                      0: 0.0%  335593472:35.9%  335550464: 9.8%  134297600:28.0%         15
tree.c:1257 (build_int_cst_wide)                  144244592: 8.0%          0: 0.0%  198866208: 5.8%  113097680:23.5%      60267
lto-streamer-in.c:1881 (lto_materialize_tree)    1319860448:73.1%          0: 0.0% 1950715080:57.0%   74550688:15.5%   29259517
Total                                            1804556313        934357752       3423228826        480284459         49853300
source location                                     Garbage            Freed             Leak         Overhead            Times

Kind                   Nodes      Bytes
---------------------------------------
decls                11502734 1829746088
types                4430124  744260832
blocks                     1         88
stmts                      0          0
refs                    8173     485872
exprs                2358594  113315792
constants            2245230   86809013
identifiers           442742   17709680
vecs                   60267  116915440
binfos               1107669  110741304
ssa names                309      27192
constructors          310545    9937440
random kinds         10648367  425935048
lang_decl kinds            0          0
lang_type kinds            0          0
omp clauses                0          0
---------------------------------------
Total                33114755 -839083507
---------------------------------------
Comment 65 Markus Trippelsdorf 2011-04-03 11:32:08 UTC
(In reply to comment #62)
> and since it doesn't fail at link time, this is debug info bug, not LTO, so if
> you get a testcase, please open a new PR.

You're right, it builds fine without "-g" (ac_add_options --disable-debug-symbols).

But the build now fails early when elfhack is enabled:

with gold:
c++ -o elfhack -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth
-Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof
-Wno-variadic-macros -Werror=return-type -Wno-long-long -march=native
-fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing
-fshort-wchar -pthread -pipe -fexceptions  -DNDEBUG -DTRIMMED -g -O3 -lpthread
-Wl,-O1,--hash-style=gnu,--as-needed,--no-keep-memory  
-Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/bin
-Wl,-rpath-link,/usr/lib  host_elf.o host_elfhack.o
/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.0/../../../../x86_64-pc-linux-gnu/bin/ld:
/tmp/ccGQbukN.ltrans3.ltrans.o: in function
_ZN8Elf_Ehdr9serializeERSt14basic_ofstreamIcSt11char_traitsIcEEcc.local.402:/var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:239:
error: undefined reference to 'void Elf_Ehdr_Traits::swap<big_endian,
Elf64_Ehdr, serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&,
Elf64_Ehdr&)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.0/../../../../x86_64-pc-linux-gnu/bin/ld:
/tmp/ccGQbukN.ltrans3.ltrans.o: in function
_ZN8Elf_Ehdr9serializeERSt14basic_ofstreamIcSt11char_traitsIcEEcc.local.402:/var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:228:
error: undefined reference to 'void Elf_Ehdr_Traits::swap<big_endian,
Elf32_Ehdr, serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&,
Elf32_Ehdr&)'
collect2: ld returned 1 exit status
make[7]: *** [elfhack] Error 1

or with gnu-ld:
In function `serialize':
/var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:239: undefined reference to
`void Elf_Ehdr_Traits::swap<big_endian, Elf64_Ehdr,
serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&, Elf64_Ehdr&)'
/var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:228: undefined reference to
`void Elf_Ehdr_Traits::swap<big_endian, Elf32_Ehdr,
serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&, Elf32_Ehdr&)'
collect2: ld returned 1 exit status

see also: https://bugzilla.mozilla.org/show_bug.cgi?id=647458
(but it does look more like a gcc lto bug to me)
Comment 66 froydnj@codesourcery.com 2011-04-04 01:18:59 UTC
On Sun, Apr 03, 2011 at 10:09:06AM +0000, hubicka at gcc dot gnu.org wrote:
> Kind                   Nodes      Bytes
> ---------------------------------------
> decls                11502734 1829746088
> types                4430124  744260832
> blocks                     1         88
> stmts                      0          0
> refs                    8173     485872
> exprs                2358594  113315792
> constants            2245230   86809013
> identifiers           442742   17709680
> vecs                   60267  116915440
> binfos               1107669  110741304
> ssa names                309      27192
> constructors          310545    9937440
> random kinds         10648367  425935048
> lang_decl kinds            0          0
> lang_type kinds            0          0
> omp clauses                0          0
> ---------------------------------------
> Total                33114755 -839083507
> ---------------------------------------

Do folks think it would be useful to include a breakdown by individual
TREE_CODE, similar to what's done for RTXes?
Comment 67 Richard Biener 2011-04-04 12:30:07 UTC
(In reply to comment #66)
> On Sun, Apr 03, 2011 at 10:09:06AM +0000, hubicka at gcc dot gnu.org wrote:
> > Kind                   Nodes      Bytes
> > ---------------------------------------
> > decls                11502734 1829746088
> > types                4430124  744260832
> > blocks                     1         88
> > stmts                      0          0
> > refs                    8173     485872
> > exprs                2358594  113315792
> > constants            2245230   86809013
> > identifiers           442742   17709680
> > vecs                   60267  116915440
> > binfos               1107669  110741304
> > ssa names                309      27192
> > constructors          310545    9937440
> > random kinds         10648367  425935048
> > lang_decl kinds            0          0
> > lang_type kinds            0          0
> > omp clauses                0          0
> > ---------------------------------------
> > Total                33114755 -839083507
> > ---------------------------------------
> 
> Do folks think it would be useful to include a breakdown by individual
> TREE_CODE, similar to what's done for RTXes?

I have posted a patch for this last year, but it seems I forgot to commit
it.
Comment 68 froydnj@codesourcery.com 2011-04-04 13:13:01 UTC
On Mon, Apr 04, 2011 at 01:01:27PM +0000, rguenth at gcc dot gnu.org wrote:
> > Do folks think it would be useful to include a breakdown by individual
> > TREE_CODE, similar to what's done for RTXes?
> 
> I have posted a patch for this last year, but it seems I forgot to commit
> it.

Well, it'd be most interesting to see the per-code breakdown for Honza's
earlier numbers.
Comment 69 Mark Mitchell 2011-04-05 00:16:02 UTC
On 4/4/2011 3:19 AM, froydnj at codesourcery dot com wrote:

> Do folks think it would be useful to include a breakdown by individual
> TREE_CODE, similar to what's done for RTXes?

Sure couldn't hurt, and I can definitely think of situations where I
wanted exactly that.

Thank you,
Comment 70 Jan Hubicka 2011-04-07 19:15:19 UTC
I can not reproduce the aforementioned elfhack failure. For me build fails later at
/abuild/jh/trunk-install/bin/g++ -flto=24 -fuse-linker-plugin -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -fno-strict-aliasing -fshort-wchar -pthread -pipe -fexceptions  -DNDEBUG -DTRIMMED -g -Os -freorder-blocks -fomit-frame-pointer  -fPIC -shared -Wl,-z,defs -Wl,-h,test.so -o test.so test.o
===
=== If you get failures below, please file a bug describing the error
=== and your environment (compiler and linker versions), and use
=== --disable-elf-hack until this is fixed.
===
/abuild/jh/build-mozilla-new11-lto-elfhack/build/unix/elfhack/elfhack -b test.so
test.so: terminate called after throwing an instance of 'std::runtime_error'
  what():  Section index out of bounds
make[5]: *** [test.so] Aborted (core dumped)

I tend to believe that this is elfhack problem.  Only way for me to get similar linker error is to disable the linker plugin and use -fwhole-program.
Can you, please, try to build with -save-temps -fdump-ipa-cgraph and attach the produced *.res and *wpa*cgraph files?
Comment 71 Markus Trippelsdorf 2011-04-07 19:38:17 UTC
Created attachment 23917 [details]
-lm.res
Comment 72 Markus Trippelsdorf 2011-04-07 19:39:29 UTC
Created attachment 23918 [details]
elfhack.wpa.000i.cgraph
Comment 73 Markus Trippelsdorf 2011-04-07 19:59:30 UTC
Jan,
elfhack only fails to build if I use:
ac_add_options --enable-optimize=-O3
in my .mozconfig.
When I delete the =-O3 part everything builds fine.
Comment 74 Jan Hubicka 2011-04-07 22:07:38 UTC
Interesting. -O3 makes no difference for me.  I will look into your dumps if I can spot something useful.

The behavior I observe is that GCC optimize away all the strings that are placed into test.so. I didn't look deeper into it (I am looking if i can reproduce your dwarf2out ICE and get a testcase right now), but I think it is what makes my elfhack test to fail. I am surprised it does not happen for yours.

If GCC fail to link even such a simple program as elfhack is, something pretty fundamental must go wrong.  Perhaps it is linker bug. I had problems with older versions of gold.
Comment 75 Markus Trippelsdorf 2011-04-08 06:52:34 UTC
(In reply to comment #74)
> Interesting. -O3 makes no difference for me.  I will look into your dumps if I
> can spot something useful.
> ...
> If GCC fail to link even such a simple program as elfhack is, something pretty
> fundamental must go wrong.  Perhaps it is linker bug. I had problems with older
> versions of gold.

The failure only happens with -flto.
And the reason is that:

c++ -o host_elf.o -c -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fno-strict-aliasing -fshort-wchar -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -Os -I/var/tmp/mozilla-central/build/unix/elfhack -I. -I../../../dist/include -I../../../dist/include/nsprpub -I/usr/include/nspr -I/usr/include/nss -I/usr/include/nspr /var/tmp/mozilla-central/build/unix/elfhack/elf.cpp

apparently only compiles correctly in the -Os case. All other optimization switches (-O(0..3) or without -O) lead to the eventual link failure above.
And it happens with both gnu-ld and gold (2.21.51.20110402).
Comment 76 Markus Trippelsdorf 2011-04-08 15:42:23 UTC
Created attachment 23930 [details]
Output of  -Wl,-Map good
Comment 77 Markus Trippelsdorf 2011-04-08 15:51:09 UTC
Created attachment 23931 [details]
Output of  -Wl,-Map bad

I've attached the output of "-Wl,-Map,map" of both
cases (-Os vs. -O2). 
Please do a vimdiff of both and search for
Elf_Ehdr9serializeERSt14basic_ofstreamIcSt11char_traitsIcEEcc
and you'll see that in the good case it lives in its own ltrans file:
/tmp/cca0jnrX.ltrans9.ltrans.o
while in the bad case it is thrown together with other headers into:
/tmp/ccd8WyNK.ltrans3.ltrans.o 
which then leads to the link error above.
Comment 78 Mike Hommey 2011-04-08 15:57:14 UTC
(In reply to comment #75)
> (In reply to comment #74)
> > Interesting. -O3 makes no difference for me.  I will look into your dumps if I
> > can spot something useful.
> > ...
> > If GCC fail to link even such a simple program as elfhack is, something pretty
> > fundamental must go wrong.  Perhaps it is linker bug. I had problems with older
> > versions of gold.
> 
> The failure only happens with -flto.
> And the reason is that:
> 
> c++ -o host_elf.o -c -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual
> -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align
> -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long
> -march=native -fpermissive -flto=4 -fuse-linker-plugin -fno-strict-aliasing
> -fshort-wchar -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -Os
> -I/var/tmp/mozilla-central/build/unix/elfhack -I. -I../../../dist/include
> -I../../../dist/include/nsprpub -I/usr/include/nspr -I/usr/include/nss
> -I/usr/include/nspr /var/tmp/mozilla-central/build/unix/elfhack/elf.cpp
> 
> apparently only compiles correctly in the -Os case. All other optimization
> switches (-O(0..3) or without -O) lead to the eventual link failure above.
> And it happens with both gnu-ld and gold (2.21.51.20110402).

What matters is what is used to build/link test.so, not elfhack itself, and from the look at the command line in comment 70, you're building test.so with unexpected things. It is not meant to be optimized. So, some more variables tweaking would apparently be required in build/unix/elfhack/Makefile.in.
Comment 79 Markus Trippelsdorf 2011-04-08 16:10:01 UTC
(In reply to comment #78)

> What matters is what is used to build/link test.so, not elfhack itself, and
> from the look at the command line in comment 70, you're building test.so with
> unexpected things. It is not meant to be optimized. So, some more variables
> tweaking would apparently be required in build/unix/elfhack/Makefile.in.

There are two different issues that we're talking about:

-The link error when you build with --enable-optimize=-O3
This has nothing to do with test.so AFAICS.

-The test failure Jan reported, which only happens _after_ elfhack
is successfully build. And in this case your comment above may apply.
Comment 80 Jan Hubicka 2011-04-11 11:00:05 UTC
Hi,
in the resolution files, the swap functions are already undefined

5382 3d06433b UNDEF __assert_fail
5400 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI13little_endian10Elf32_Ehdr12serializableIS_EEEvRT1_RT0_
5447 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf64_Ehdr12serializableIS_EEEvRT1_RT0_
5455 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI13little_endian10Elf64_Ehdr12serializableIS_EEEvRT1_RT0_
5459 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf32_Ehdr12serializableIS_EEEvRT1_RT0_

I currently have problems to get past firewall to my mozilla build, but this seems like another instance of problem with COMDATs - i.e. host_elfhack including some header that makes use of those functions in something that is inlined and consistently optimized out in normal compilation but due to comdat issues it stays stuck in the LTO output.

According to cgraph dump it is used by
Comment 81 Jan Hubicka 2011-04-11 11:13:32 UTC
Sorry, firefox concluded I want to save changes when I didn't ;)

The problem is function Elf_Ehdr::serialize(std::basic_ofstream<char, std::char_traits<char> >&, char, char)

What I see is that this function is defined several times in the unmerged cgraph (i.e. it is comdat inline coming from different .o files) and _some_ of the definitions calls swap function that is not defined, while other definitions calls swap function that is defined.

In your build the one that calls undefined swap wins resulting in final link error.  I am not sure if this is GCC bug or elfhack, but I would guess for elfhack actually.

This is whole bit tricky since the COMDAT hack comes into game here: GCC is not telling linker in the LTO symtab about COMDATs for inline functions when their address is not taken since they should be defined in every unit that needs them.
It is not the case here.

I think either SWAP should be keyed in one of the units that it is apparently not:
swap/622(-1) @0x7f4d15d34000 (asm: _ZN18Elf_RelHack_Traits4swapI10big_endian9Elf32_Rel12serializableIS_EEEvRT1_RT0_) analyzed 19 time, 16 benefit 31 size, 8 benefit externally_visible finalized inlinable
  called by: serialize/623 (0.65 per call) serialize/623 (0.25 per call)
  calls: __builtin_constant_p/466 (1.00 per call) __builtin_constant_p/466 (1.00 per call)
  References:
  Refering this function:

(other defs looks like)

swap/825(-1) @0x7f4d15d466e0 (asm: _ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf32_Ehdr12serializableIS_EEEvRT1_RT0_) undef
  called by: serialize/595 (0.25 per call) (can throw external)
  calls:
  References:
  Refering this function:

Or this is some source level bug. I.e. one unit just forward declaring the function while other defining it as comdat inline that is probably violation of one declaration rule.

Would be possible for you to look into preprocessed source files of elfhack and see what units define the serialize function and among those how the swap defintions look like?

We probably could make lto-symtab to not give up on seeing Undef resolution from linker in these cases, but I would rather avoid pilling up hacks around this COMDAT mess.

Honza
Comment 82 Markus Trippelsdorf 2011-04-11 15:08:28 UTC
(In reply to comment #81)
> 
> The problem is function Elf_Ehdr::serialize(std::basic_ofstream<char,
> std::char_traits<char> >&, char, char)
... 
> Would be possible for you to look into preprocessed source files of elfhack and
> see what units define the serialize function and among those how the swap
> defintions look like?
> 

I think it would be best if you take a look at the source files 
yourself once your firewall problem is solved, because there
are actually only two of them (elfxx.h and elf.cpp).

The instantiation takes place in elfxx.h:431 and elf.cpp:142.

BTW when I use -frepo to compile host_elf.o the link error goes away.

And if I recompile host_elf.o without -frepo, but leave the host_elf.rpo
file, this is what happens:

collect: recompiling /var/tmp/mozilla-central/build/unix/elfhack/elf.cpp
collect: relinking
collect2: '_ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf64_Ehdr12serializableIS_EEEvRT1_RT0_' was assigned to 'host_elf.rpo', but was not defined during recompilation, or vice versa
and then the link error from above follows.
Comment 83 Markus Trippelsdorf 2011-04-11 18:44:07 UTC
> I am not sure if this is GCC bug or elfhack, but I would guess for
elfhack actually.

I guess you're right, because when I move the swap definitions:

template <class endian, typename R, typename T>
inline void Elf_Ehdr_Traits::swap(T &t, R &r)
...

from elf.cpp to elfxx.h (where they actually belong) the 
link error vanishes.
Comment 84 Mike Hommey 2011-04-12 10:53:44 UTC
(In reply to comment #83)
> > I am not sure if this is GCC bug or elfhack, but I would guess for
> elfhack actually.
> 
> I guess you're right, because when I move the swap definitions:
> 
> template <class endian, typename R, typename T>
> inline void Elf_Ehdr_Traits::swap(T &t, R &r)
> ...
> 
> from elf.cpp to elfxx.h (where they actually belong) the 
> link error vanishes.

I'm not convinced they belong there. But wouldn't removing the "inline" keyword work equally well?
Comment 85 Jan Hubicka 2011-04-12 16:22:13 UTC
Thanks for analysis. removing inline should work too.
while as qoi issue gcc can find the missing bodu, i think it is better to avoid more hacks. for 4.7 i will implement the new comdat proposal.
does elfhack work for you now?
Comment 86 Markus Trippelsdorf 2011-04-12 16:42:34 UTC
(In reply to comment #85)
> does elfhack work for you now?

Yes, no problems anymore.
Comment 87 Jan Hubicka 2011-04-22 12:52:17 UTC
http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01854.html has updated bulild time/memory stats. With Michaels WPA patch, we now need about 5GB of address space on 64bit build, so we might fit in 32bit again.
Comment 88 Jan Hubicka 2011-04-22 15:03:16 UTC
As a quick status update, mozilla now builds and works with TOT GCC tree again, after fixes to debug info streaming and clone materialization.  -g still fails at PR48724
Comment 89 Jan Hubicka 2011-05-02 10:13:00 UTC
This is callgrind profile for our hashtables that are consuming most of time at WPA stage.  It is from javascript library, but probably close enough for libxul:

    9,413,074  < ipa.c:cgraph_node_set_add (47698x) 
  237,777,114  < lto-streamer-in.c:lto_input_location (253470x) 
      162,391  < cgraph.c:cgraph_same_body_alias_1 (1125x) 
    3,481,459  < lto/lto.c:lto_create_files_from_ids (18272x) 
1,262,433,061  < lto-streamer.c:lto_streamer_cache_insert_1 (9456405x) 
    1,721,939  < cgraph.c:cgraph_remove_node (13507x) 
   32,443,118  < cgraph.c:cgraph_get_node (254257x) 
   15,700,040  < lto/lto.c:remember_with_vars (88495x) 
  100,462,329  < lto-streamer.c:lto_streamer_cache_lookup (959530x) 
   59,948,506  < lto/lto-object.c:lto_obj_add_section (38584x) 
  551,876,527  < gimple.c:gimple_register_type'2 (9863x) 
   15,332,148  < lto-symtab.c:lto_symtab_get (148180x) 
  123,454,996  < ipa.c:varpool_node_set_find (1090522x) 
  497,594,354  < gimple.c:gimple_register_canonical_type (174920x) 
    7,723,287  < lto-section-out.c:lto_output_decl_index (48869x) 
    1,363,423  < lto-section-in.c:lto_get_function_in_decl_state (13102x) 
   60,607,732  < ipa.c:cgraph_node_set_find (526286x) 
    3,220,597  < varpool.c:varpool_node (19821x) 
    3,316,861  < lto-symtab.c:lto_symtab_register_decl (23462x) 
  523,758,152  < lto-streamer-out.c:lto_output_string_with_length (793000x) 
   30,909,893  < lto/lto.c:create_subid_section_table (19190x) 
    4,593,607  < cgraph.c:cgraph_create_node (22343x) 
      223,259  < cgraph.c:cgraph_clone_node (1353x) 
   20,940,173  < lto-section-in.c:lto_record_renamed_decl (14960x) 
2,983,016,896  < gimple.c:gimple_register_type (149596x) 
    3,876,333  < cgraph.c:cgraph_get_node_or_alias (27793x) 
      123,200  < varpool.c:varpool_remove_node (973x) 
   46,083,990  < tree.c:build_int_cst_wide (247788x) 
    4,703,171  < ipa.c:cgraph_node_set_remove (40839x) 
  261,240,516  * libiberty/hashtab.c:htab_find_slo
So it seems that in addition to type merging we have quite few other problems.  varpool_node_set_find seems just stupid, for example.
Comment 90 Jan Hubicka 2011-05-02 12:41:15 UTC
Per node memory usage statistics for WPA
Code                   Nodes
----------------------------
identifier_node       428715
tree_list            10992455
tree_vec               54594
enumeral_type          49860
integer_type          201079
real_type               1975
pointer_type         1575376
reference_type        102944
array_type             98085
record_type           903172
union_type             17170
void_type               1496
function_type         127906
method_type          1533898
integer_cst           767153
real_cst               15992
string_cst           1224809
function_decl        2473011
label_decl            264118
field_decl           1399608
var_decl               86596
const_decl            510913
parm_decl            5530790
type_decl             964008
result_decl           553028
debug_expr_decl       144282
namespace_decl          9876
constructor           160380
nop_expr              508605
addr_expr             789320
tree_binfo           1090674
Comment 91 Jan Hubicka 2011-05-03 17:34:56 UTC
Hi,
with the patch I just posted for removal of hash tables for cgraph/varpool node set, the situation with hashing is better. We got from 900s WPA stage to 500s WPA stage.

Streaming still dominate:
 ipa lto decl in       : 331.26 (56%) usr   5.51 (34%) sys 337.11 (56%) wall  722314 kB (46%) ggc
 ipa lto decl out      : 118.21 (20%) usr   4.37 (27%) sys 122.57 (20%) wall       0 kB ( 0%) ggc
 ipa lto decl merge    :  23.61 ( 4%) usr   0.20 ( 1%) sys  23.83 ( 4%) wall     962 kB ( 0%) ggc
 inline heuristics     :  57.12 (10%) usr   0.14 ( 1%) sys  57.27 ( 9%) wall  227500 kB (14%) ggc
 TOTAL                 : 587.02            16.36           604.01            1585790 kB

(I have plans for fixing inliner once more prominent problems are solved)
Streaming in oprofile:
150985   20.6876  lto1                     htab_find_slot_with_hash
71532     9.8012  lto1                     gimple_types_compatible_p
55971     7.6690  libc-2.11.1.so           _int_malloc
55104     7.5502  lto1                     iterative_hash_hashval_t
33160     4.5435  lto1                     type_pair_eq
31554     4.3235  libc-2.11.1.so           memset
25670     3.5172  lto1                     gtc_visit
23972     3.2846  lto1                     gimple_type_hash_1
21562     2.9544  lto1                     lto_input_tree
15230     2.0868  lto1                     gt_ggc_mx_lang_tree_node
14807     2.0288  lto1                     inflate_fast

callgrind profile (of javascript instead of libxul) shows that tree_map_base hash is the most busy one:

  453,603,428  *   libiberty/../../libiberty/hashtab.c:htab_find_slot_with_hash'2 
   33,167,620  >   gcc/../../gcc/tree.c:tree_map_base_eq (6633524x) 
  134,245,948  >   libiberty/../../libiberty/hashtab.c:htab_expand (18x) 
     25,459,797  >   gcc/../../gcc/gimple.c:type_pair_eq (2793149x)

and the users of hashing:
   63,519,720  < /libiberty/hashtab.c:htab_find_slot'2 (676308x)
3,975,492,482  < /libiberty/hashtab.c:htab_find_slot (2179693x)
  255,072,048  *  /libiberty/hashtab.c:htab_find_slot_with_hash

   14,530,222  < /gcc/gimple.c:iterative_hash_gimple_type'2 (52634x)
  526,622,873  < /gcc/gimple.c:lookup_type_pair.isra.103.constprop.111 (1621144x)
   17,415,611  < /gcc/gimple.c:iterative_hash_gimple_type (100893x)
   11,734,620  < /gcc/gimple.c:visit'2 (98730x)
  432,531,796  < /gcc/gimple.c:gimple_type_hash_1 (3851023x)
   35,405,473  < /gcc/gimple.c:visit (319520x)
  108,790,992  *  /libiberty/hashtab.c:htab_find_slot'2


Oprofile of the whole build shows also problem in decl_assembler_name_equal (because of our stupit alias hacks) and can_inline_edge_p.  I will look into those two.
260739    7.1750  lto1                     lto1                     htab_find_slot_with_hash
151080    4.1574  lto1                     lto1                     decl_assembler_name_equal
130969    3.6040  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
100723    2.7717  lto1                     lto1                     gimple_types_compatible_p
97370     2.6794  lto1                     lto1                     iterative_hash_hashval_t
75051     2.0653  libc-2.11.1.so           libc-2.11.1.so           memset
56508     1.5550  lto1                     lto1                     bitmap_set_bit
53211     1.4643  lto1                     lto1                     can_inline_edge_p
51613     1.4203  oprofiled                oprofiled                /usr/bin/oprofiled
49992     1.3757  lto1                     lto1                     pointer_map_insert
48381     1.3313  lto1                     lto1                     lto_input_tree
44467     1.2236  lto1                     lto1                     type_pair_eq
35096     0.9658  libc-2.11.1.so           libc-2.11.1.so           _int_free
35069     0.9650  lto1                     lto1                     gtc_visit

(this is including ltrans stage)
Honza
Comment 92 Jan Hubicka 2011-05-19 22:28:18 UTC
decl in is now at 96 seconds.
oprofile for streaming in is:
27469     9.3054  lto1                     htab_find_slot_with_hash
23175     7.8508  libc-2.11.1.so           _int_malloc
18044     6.1126  lto1                     lto_input_tree
14823     5.0215  libc-2.11.1.so           memset
14108     4.7792  lto1                     gt_ggc_mx_lang_tree_node
13511     4.5770  lto1                     inflate_fast
11805     3.9991  lto1                     gimple_type_eq
11247     3.8100  lto1                     lto_input_uleb128
11227     3.8033  lto1                     ggc_set_mark
10903     3.6935  lto1                     pointer_map_insert

So obviously still some place for improvements for merging. I think malloc calls come mostly from SCC detection code (we create a lot of temporary obstacks and pointer maps). lto_input_tree can probably handle quite a lot of optimizations reducing amount of data we stream. Plus we don't really need to stream ulebs for everything.

For whole WPA we now need about 5 minutes, the oprofile is:
152067   14.9073  lto1                     decl_assembler_name_equal
48258     4.7308  lto1                     htab_find_slot_with_hash
46730     4.5810  lto1                     edge_badness
37954     3.7207  libc-2.11.1.so           _int_malloc
36692     3.5970  lto1                     pointer_map_insert
30387     2.9789  lto1                     do_estimate_growth
28496     2.7935  lto1                     lto_input_tree
20992     2.0579  lto1                     inflate_fast
20765     2.0356  libc-2.11.1.so           memset
20264     1.9865  lto1                     varpool_node_for_asm
19784     1.9394  lto1                     lto_output_tree
19121     1.8745  lto1                     htab_hash_string
19053     1.8678  lto1                     lto_input_uleb128

good news is that decl_assembler_name_equal is stupid handling of varpool aliases in varpool_node_for_asm that will go away with my alias rewrite.
edge_badness is easy to track down, too, it is just inliner updating paranoia.

Honza
Comment 93 Jan Hubicka 2011-05-19 22:41:41 UTC
Time report:
 ipa lto gimple out    :  10.28 ( 4%) usr   1.05 (11%) sys  11.35 ( 4%) wall       0 kB ( 0%) ggc
 ipa lto decl in       :  98.45 (37%) usr   2.23 (24%) sys 100.91 (36%) wall  713587 kB (45%) ggc
 ipa lto decl out      :  82.47 (31%) usr   2.92 (31%) sys  85.84 (31%) wall       0 kB ( 0%) ggc
 inline heuristics     :  31.74 (12%) usr   0.14 ( 1%) sys  32.07 (11%) wall  240317 kB (15%) ggc
 TOTAL                 : 269.41             9.36           279.78            1595687 kB

GIMPLE type table: size 1048573, 427153 elements, 6361837 searches, 23794591 collisions (ratio: 3.740208)
GIMPLE type hash table: size 4194301, 1452245 elements, 72676685 searches, 47569100 collisions (ratio: 0.654530)
GIMPLE canonical type table: size 65521, 48844 elements, 762160 searches, 552280 collisions (ratio: 0.724625)
GIMPLE canonical type hash table: size 1048573, 402512 elements, 2184661 searches, 1627547 collisions (ratio: 0.744988)

Nice improvement.
My reading is that GIMPLE type hash table would be better an TYPE_UID indexed array (or an pointer map if it was told to be in GGC).  76 million searches is quite a lot.

Honza
Comment 94 Jan Hubicka 2011-05-20 15:30:38 UTC
Callgrinding htab_find_slot_with_hash leads to:

 2,535,276,742  < /libiberty/hashtab.c:htab_find_slot'2 (27545437x) [//lto1]
84,947,655,239  < /libiberty/hashtab.c:htab_find_slot (52919141x) [//lto1]
 7,097,218,396  *  /libiberty/hashtab.c:htab_find_slot_with_hash [//lto1]

   172,769,366  < /gcc/gimple.c:iterative_hash_gimple_type'2 (1062343x) [//lto1]
   172,240,553  < /gcc/gimple.c:iterative_hash_canonical_type'2 (1385651x) [//lto1]
   577,192,890  < /gcc/gimple.c:iterative_hash_gimple_type (3503598x) [//lto1]
   272,475,796  < /gcc/gimple.c:visit'2 (2487924x) [//lto1]
 5,719,882,429  < /gcc/gimple.c:gimple_type_hash (54720792x) [//lto1]
   220,431,173  < /gcc/gimple.c:iterative_hash_canonical_type (1878732x) [//lto1]
 1,049,746,336  < /gcc/gimple.c:visit (10902158x) [//lto1]
 1,366,941,564  *  /libiberty/hashtab.c:htab_find_slot'2 [//lto1]


 1,663,235,593  < /gcc/gimple.c:gimple_register_canonical_type (1841890x) [//lto1]
 9,524,617,674  < /gcc/lto-streamer-in.c:lto_input_location (11940149x) [//lto1]

88,359,773,304  < /gcc/gimple.c:gimple_register_type_1 (6184225x) [//lto1]
   919,314,384  < /gcc/tree.c:build_int_cst_wide (2665535x) [//lto1]
   337,283,088  < /gcc/cgraph.c:cgraph_get_node_or_alias (2410404x) [//lto1]
 1,856,067,526  < /gcc/lto/lto.c:remember_with_vars (10704387x) [//lto1]
   265,696,672  < /gcc/lto-symtab.c:lto_symtab_register_decl (2471602x) [//lto1]
 1,020,331,990  < /gcc/lto-symtab.c:lto_symtab_get (10402341x) [//lto1]
   952,544,538  *  /libiberty/hashtab.c:htab_find_slot [//lto1]

So gimple_type_hash (54 million), input_locaiton and remember_with_vars (with about 10 million) seems to be major (ab)users of hashing now.

For malloc abuse, the major source is pointer_map_create (66 million calls), and vec_heap_o_reserve_1 (23 million) and obstack_begin (22 million) that leads to...

     
30,424,353,893  < /gcc/gimple.c:gimple_type_eq (18852945x) [//lto1]
 5,578,574,652  < /gcc/gimple.c:gimple_type_hash (3452343x) [//lto1]
   401,735,124  *  /gcc/pointer-set.c:pointer_map_create [//lto1]
Comment 95 Jan Hubicka 2011-05-20 15:31:28 UTC
... and
 7,456,601,134  < /gcc/gimple.c:gimple_type_eq (18852945x) [//lto1]
 1,384,102,312  < /gcc/gimple.c:gimple_type_hash (3452343x) [//lto1]
   936,822,402  *  ???:_obstack_begin [/lib64/libc-2.11.1.so]
Comment 96 Jan Hubicka 2011-05-27 21:57:27 UTC
Stream in oprofile is now quite changed:

33258     9.6313  lto1                     htab_find_slot_with_hash
29679     8.5949  lto1                     lto_input_tree
18338     5.3106  lto1                     gt_ggc_mx_lang_tree_node
15723     4.5533  lto1                     ggc_set_mark
15109     4.3755  lto1                     inflate_fast
13883     4.0204  lto1                     ht_lookup_with_hash
12957     3.7523  lto1                     pointer_map_insert
12433     3.6005  libc-2.11.1.so           memset
8661      2.5082  lto1                     lto_input_uleb128
8584      2.4859  libc-2.11.1.so           _int_malloc
6832      1.9785  lto1                     ggc_internal_alloc_stat
6722      1.9467  lto1                     ht_lookup

We do have nice improvements on merging and streaming effectivity. Still burning over 10% in hashing don't seem quite reasonable.

I am not sure if most of the htab overhead is still the type merging given that rest of it is off profile.  It may be something stupid, like the file name hash, that is queried every time file is changed in the location. Probably should re-do callgraph profile later next week.

I do have some extra patches to reduce uleb streaming overhead and further make lto_input_tree bit more streamlined that might help a little. Not sure how much real room for improvement for simple optimizations in this direction is left and how much we really need to look into streaming fewer trees.

 garbage collection    :  16.29 ( 6%) usr   0.02 ( 0%) sys  16.33 ( 6%) wall       0 kB ( 0%) ggc
 ipa lto decl in       :  76.15 (28%) usr   2.96 (21%) sys  79.33 (28%) wall  722892 kB (44%) ggc
 ipa lto decl out      :  83.36 (31%) usr   4.58 (32%) sys  88.37 (31%) wall       0 kB ( 0%) ggc
 ipa lto decl merge    :  14.59 ( 5%) usr   0.00 ( 0%) sys  14.64 ( 5%) wall     801 kB ( 0%) ggc
 inline heuristics     :  40.95 (15%) usr   0.19 ( 1%) sys  41.40 (14%) wall  241725 kB (15%) ggc

Memory needed is down, too, at about 4.3GB (in 64bit compilation).

GIMPLE type table: size 1048573, 570402 elements, 5098430 searches, 3158421 collisions (ratio: 0.619489)
GIMPLE type hash table: size 4194301, 1441169 elements, 44401918 searches, 37071081 collisions (ratio: 0.834898)
GIMPLE canonical type table: size 65521, 49079 elements, 896788 searches, 575628 collisions (ratio: 0.641877)
GIMPLE canonical type hash table: size 1048573, 524811 elements, 2845518 searches, 2279153 collisions (ratio: 0.800962)
[WPA] Compression: 424774798 input bytes, 1619588170 uncompressed bytes (ratio: 3.812816)
Comment 97 Jan Hubicka 2011-06-02 13:28:28 UTC
Today I noticed by an accident that the following hack:
Index: lto-streamer-out.c
===================================================================
--- lto-streamer-out.c  (revision 174547)
+++ lto-streamer-out.c  (working copy)
@@ -1135,15 +1288,15 @@

   lto_output_tree_or_ref (ob, BINFO_OFFSET (expr), ref_p);
   lto_output_tree_or_ref (ob, BINFO_VTABLE (expr), ref_p);
-  lto_output_tree_or_ref (ob, BINFO_VIRTUALS (expr), ref_p);
+  /*lto_output_tree_or_ref (ob, BINFO_VIRTUALS (expr), ref_p);*/
   lto_output_tree_or_ref (ob, BINFO_VPTR_FIELD (expr), ref_p);

   output_uleb128 (ob, VEC_length (tree, BINFO_BASE_ACCESSES (expr)));
   FOR_EACH_VEC_ELT (tree, BINFO_BASE_ACCESSES (expr), i, t)
     lto_output_tree_or_ref (ob, t, ref_p);

-  lto_output_tree_or_ref (ob, BINFO_INHERITANCE_CHAIN (expr), ref_p);
-  lto_output_tree_or_ref (ob, BINFO_SUBVTT_INDEX (expr), ref_p);
+  /* Backend do not care about BINFO_INHERITANCE_CHAIN and BINFO_SUBVTT_INDEX.
+   */
   lto_output_tree_or_ref (ob, BINFO_VPTR_INDEX (expr), ref_p);
 }

@@ -2014,7 +2167,7 @@
     lto_output_tree_ref (ob, t);

   /* Output the head of the arguments list.  */
-  lto_output_tree_ref (ob, DECL_ARGUMENTS (function));
+  lto_output_chain (ob, DECL_ARGUMENTS (function), true);

   /* Output all the SSA names used in the function.  */
   output_ssa_names (ob, fn);
Index: lto-streamer-in.c
===================================================================
--- lto-streamer-in.c   (revision 174547)
+++ lto-streamer-in.c   (working copy)
@@ -2308,7 +2438,7 @@
   while (t);

   BINFO_OFFSET (expr) = lto_input_tree (ib, data_in);
-  BINFO_VTABLE (expr) = lto_input_tree (ib, data_in);
+  /*BINFO_VTABLE (expr) = lto_input_tree (ib, data_in);*/
   BINFO_VIRTUALS (expr) = lto_input_tree (ib, data_in);
   BINFO_VPTR_FIELD (expr) = lto_input_tree (ib, data_in);

@@ -2323,8 +2453,6 @@
        }
     }

-  BINFO_INHERITANCE_CHAIN (expr) = lto_input_tree (ib, data_in);
-  BINFO_SUBVTT_INDEX (expr) = lto_input_tree (ib, data_in);
   BINFO_VPTR_INDEX (expr) = lto_input_tree (ib, data_in);
 }


Reduces memory usage from 4.4GB to 2.7GB, so almost halves it and proportionally improves compilation speed.  The effect is disabling type based devirtualization.

The difference is amount of IL sreamed.  W/o hack
> [WPA] Compression: 430817772 input bytes, 2004640654 uncompressed bytes (ratio: 4.653106)
> [WPA] Size of mmap'd section decls: 267817970 bytes
> [WPA] Size of mmap'd section function_body: 144808174 bytes
>  ipa lto decl in       :  74.90 (30%) usr   2.38 (19%) sys  77.51 (29%) wall  722892 kB (44%) ggc

(ggc memory info wraps around 4GB limit, have patch for that)

With hack:
> [WPA] Compression: 308616744 input bytes, 1236371760 uncompressed bytes (ratio: 4.006172)
> [WPA] Size of mmap'd section decls: 147396203 bytes
> [WPA] Size of mmap'd section function_body: 144662716 bytes
>  ipa lto decl in       :  38.85 (23%) usr   1.18 (12%) sys  40.12 (23%) wall 2674626 kB (75%) ggc

The node stats with the patch are as follows:
identifier_node       505095
tree_list            1809449
integer_type          175310
pointer_type         1198885
reference_type         65356
array_type             96153
record_type           729335
union_type             14171
function_type         120632
method_type           504881
integer_cst           587216
string_cst            204367
function_decl         909919
label_decl            261908
field_decl           1278114
var_decl               87787
const_decl            327835
parm_decl            1653719
type_decl             771617
result_decl           559971
debug_expr_decl       147434
constructor           162322
nop_expr              531950
addr_expr             920865
tree_binfo           1013612

(to be compared with my previous stats)

Heap vector stats:

ipa-prop.c:2053 (ipa_node_duplication_hook)          540408: 0.8%    1046048           21339: 0.2%
ipa-inline-analysis.c:2008 (inline_merge_summary    1697908: 2.5%    3086804           99582: 1.1%
ipa-reference.c:185 (set_reference_optimization_    6122784: 9.0%   10353528              10: 0.0%
lto-cgraph.c:113 (lto_cgraph_encoder_encode)        6485840: 9.5%   10924352           22118: 0.2%
ipa-ref.c:59 (ipa_record_reference)                16005792:23.5%   20789048          534854: 6.0%
ipa-inline-analysis.c:647 (inline_summary_alloc)   17904344:26.3%   35257432           11486: 0.1%
passes.c:1893 (execute_one_pass)                   18076256:26.5%   20971480          474948: 5.3%
Total                                              68129708                           8892582

GGC stats:
ipa-inline-analysis.c:841 (inline_node_duplicati          0: 0.0%      42428: 0.0%   37876224: 2.3%    2058852: 0.6%     232982
gimple.c:4177 (iterative_hash_gimple_type)         43510016: 2.8%          0: 0.0%          0: 0.0%          0: 0.0%    2719376
lto-symtab.c:156 (lto_symtab_register_decl)        50215704: 3.3%          0: 0.0%          0: 0.0%          0: 0.0%     896709
lto-section-in.c:471 (lto_new_in_decl_state)         165360: 0.0%          0: 0.0%   51424080: 3.2%          0: 0.0%     429912
cgraph.c:1008 (cgraph_create_edge_1)                      0: 0.0%          0: 0.0%   77585352: 4.8%          0: 0.0%     746013
lto-streamer-in.c:2477 (lto_input_ts_constructor   34780240: 2.3%   67555760: 8.4%   45650928: 2.8%   33677352:10.4%     271362
ipa-inline-analysis.c:643 (inline_summary_alloc)          0: 0.0%          0: 0.0%   85235448: 5.3%   18126584: 5.6%          1
ipa-ref.c:54 (ipa_record_reference)                       0: 0.0%  171658064:21.4%   85633072: 5.3%   68326696:21.0%     554106
lto-streamer-in.c:1934 (lto_materialize_tree)      90241344: 5.9%          0: 0.0%   11233544: 0.7%       5872: 0.0%    1013612
lto/lto.c:217 (lto_read_in_decl_state)               333288: 0.0%          0: 0.0%  130600080: 8.1%   24601136: 7.6%    3009384
toplev.c:1027 (realloc_for_line_map)                      0: 0.0%  167815168:20.9%  167778304:10.4%   67182592:20.7%         14
tree.c:1223 (build_int_cst_wide)                  200129008:13.0%          0: 0.0%    2046496: 0.1%   66567480:20.5%      40217
cgraph.c:457 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  226542712:14.0%          0: 0.0%     765347
lto-streamer-in.c:1939 (lto_materialize_tree)    1077917488:70.1%          0: 0.0%  540532272:33.4%   28671712: 8.8%   12277142
Total                                            1537795379        803354140       1619917572        325016043         27622283
source location                                     Garbage            Freed             Leak         Overhead            Times

Honza
Comment 98 Jan Hubicka 2011-06-02 14:28:47 UTC
Martin suggested ingoring BINFOs without FLAG_2 set.  
It don't seem make much difference:

[WPA] Compression: 430287537 input bytes, 1997250286 uncompressed bytes (ratio: 4.641664)
[WPA] Size of mmap'd section decls: 267483492 bytes
 ipa lto decl in       :  73.75 (29%) usr   2.37 (17%) sys  76.27 (28%) wall  745752 kB (45%) ggc
Comment 99 Markus Trippelsdorf 2011-06-15 10:32:08 UTC
New build failure with "gold" and gcc 4.7.0 20110615:

ake[6]: Entering directory `/var/tmp/mozilla-central/moz-build-dir/js/src/shell'
js.cpp
c++ -o js.o -c  -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux3.0\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/nsprpub  -I/var/tmp/mozilla-central/moz-build-dir/dist/include/nspr    -fPIC  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -ffunction-sections -fdata-sections -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -fprofile-generate -O3   -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/js.pp /var/tmp/mozilla-central/js/src/shell/js.cpp
jsworkers.cpp
c++ -o jsworkers.o -c  -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux3.0\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/nsprpub  -I/var/tmp/mozilla-central/moz-build-dir/dist/include/nspr    -fPIC  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -ffunction-sections -fdata-sections -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -fprofile-generate -O3   -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/jsworkers.pp /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In member function ‘void js::workers::MainQueue::destroy(JSContext*)’:
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:371:16: warning: deleting object of polymorphic class type ‘js::workers::MainQueue’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor]
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In member function ‘bool js::workers::ThreadPool::start(JSContext*)’:
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:511:20: warning: deleting object of polymorphic class type ‘js::workers::WorkerQueue’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor]
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In member function ‘void js::workers::ThreadPool::shutdown(JSContext*)’:
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:548:16: warning: deleting object of polymorphic class type ‘js::workers::WorkerQueue’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor]
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In static member function ‘static void js::workers::Worker::jsFinalize(JSContext*, JSObject*)’:
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:690:20: warning: deleting object of polymorphic class type ‘js::workers::Worker’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor]
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In static member function ‘static js::workers::Worker* js::workers::Worker::create(JSContext*, js::workers::WorkerParent*, JSString*, JSObject*)’:
/var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:1073:16: warning: deleting object of polymorphic class type ‘js::workers::Worker’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor]
In file included from /var/tmp/mozilla-central/js/src/shell/js.cpp:97:0:
/var/tmp/mozilla-central/js/src/jsobjinlines.h: In member function ‘void JSObject::setArrayLength(uint32)’:
/var/tmp/mozilla-central/js/src/jsobjinlines.h:367:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
/usr/bin/python2.7 /var/tmp/mozilla-central/js/src/config/pythonpath.py -I../config /var/tmp/mozilla-central/js/src/config/expandlibs_exec.py --uselist --  c++ -o js  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -ffunction-sections -fdata-sections -fno-strict-aliasing -pthread -pipe  -DNDEBUG -DTRIMMED -fprofile-generate -O3  js.o jsworkers.o   -lpthread   -fprofile-generate -Wl,-rpath-link,/bin -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/lib  -L../../../dist/bin -L../../../dist/lib -L/var/tmp/mozilla-central/moz-build-dir/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl ../editline/libeditline.a ../libjs_static.a -ldl
/var/tmp/mozilla-central/moz-build-dir/js/src/shell/jsworkers.o:jsworkers.cpp:function js::workers::Worker::processOneEvent(): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoRequest::~JSAutoRequest()' is not defined locally
/var/tmp/mozilla-central/moz-build-dir/js/src/shell/jsworkers.o:jsworkers.cpp:function js::workers::ThreadPool::start(JSContext*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoSuspendRequest::JSAutoSuspendRequest(JSContext*)' is not defined locally
../libjs_static.a(jsapi.o):jsapi.cpp:function StopRequest(JSContext*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'js::AutoLockGC::~AutoLockGC()' is not defined locally
../libjs_static.a(jsapi.o):jsapi.cpp:function JS_ConvertArgumentsVA: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally
../libjs_static.a(jsapi.o):jsapi.cpp:function JS_New: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally
../libjs_static.a(jsarray.o):jsarray.cpp:function array_toSource(JSContext*, unsigned int, js::Value*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'js::StringBuffer::StringBuffer(JSContext*)' is not defined locally
../libjs_static.a(jsarray.o):jsarray.cpp:function array_toString_sub(JSContext*, JSObject*, int, JSString*, js::Value*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'js::StringBuffer::StringBuffer(JSContext*)' is not defined locally
../libjs_static.a(jsemit.o):jsemit.cpp:function BindNameToSlot(JSContext*, JSCodeGenerator*, JSParseNode*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally
../libjs_static.a(jsemit.o):jsemit.cpp:function BindNameToSlot(JSContext*, JSCodeGenerator*, JSParseNode*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally
../libjs_static.a(jsfun.o):jsfun.cpp:function Function(JSContext*, unsigned int, js::Value*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally
../libjs_static.a(jsfun.o):jsfun.cpp:function Function(JSContext*, unsigned int, js::Value*): warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally
...

Using the bfd linker instead of "gold" seems to work.
gcc-4.6.1 also works fine.
Comment 100 Markus Trippelsdorf 2011-06-15 10:44:52 UTC
Please note that this error only happens during a profiled build.
Normal build seems to be OK.
Comment 101 Mike Hommey 2011-06-15 11:38:01 UTC
(In reply to comment #100)
> Please note that this error only happens during a profiled build.
> Normal build seems to be OK.

FWIW: https://bugzilla.mozilla.org/show_bug.cgi?id=664387
Comment 102 Markus Trippelsdorf 2011-06-15 11:44:22 UTC
Jan,
this is caused by:
commit 8c1fce46fc02e43e82b05f49894690133a1bcdcf
Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Jun 10 20:06:48 2011 +0000

Reverting the commit "fixes" the problem.
Comment 103 Markus Trippelsdorf 2011-06-15 12:34:20 UTC
Even with 8c1fce46fc0 reverted libxul fails to link during
a profiledbuild. Normal build is fine.

with bfd:
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../../layout/ipc/RenderFrameParent.o: relocation R_X86_64_PC32 against undefined hidden symbol `nsRefPtr<mozilla::layers::ImageContainer>::~nsRefPtr()' can not be used when making a shared object
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: final link failed: Bad value

with gold:
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../layout/ipc/RenderFrameParent.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../content/events/src/nsEventStateManager.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../content/xul/templates/src/nsRuleNetwork.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../gfx/thebes/GLContextProviderGLX.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC
/var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvlatin/nsUnicodeToUCS2BE.o:nsUnicodeToUCS2BE.cpp:function vtable for nsUnicodeToUTF16BE: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsUnicodeToUTF16BE::~nsUnicodeToUTF16BE()' is not defined locally
/var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvlatin/nsUnicodeToUCS2BE.o:nsUnicodeToUCS2BE.cpp:function vtable for nsUnicodeToUTF16LE: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsUnicodeToUTF16LE::~nsUnicodeToUTF16LE()' is not defined locally
/var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvcn/nsGBKToUnicode.o:nsGBKToUnicode.cpp:function vtable for nsGBKToUnicode: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsGBKToUnicode::~nsGBKToUnicode()' is not defined locally
/var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvcn/nsGBKToUnicode.o:nsGBKToUnicode.cpp:function vtable for nsGB18030ToUnicode: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsGB18030ToUnicode::~nsGB18030ToUnicode()' is not defined locally
/var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../parser/htmlparser/src/nsHTMLTokens.o:nsHTMLTokens.cpp:function vtable for CAttributeToken: warning: relocation refers to discarded section
/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'CAttributeToken::~CAttributeToken()' is not defined locally
/var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../layout/generic/nsGfxScrollFrame.o:nsGfxScrollFrame.cpp:function vtable for nsHTMLScrollFrame: warning: relocation refers to discarded section
...
Comment 104 Jan Hubicka 2011-06-18 08:53:37 UTC
> Even with 8c1fce46fc0 reverted libxul fails to link during
> a profiledbuild. Normal build is fine.

I didn't really tested profiledbuild for a while, so I will check.
Last time I tried we was able to build libxul but still had problems
building one of later libraries because of the COMDAT issues. I filled
Mozilla PR for that (the problem really is including some classes but not
linking with their implementation).

What worked well for me is to profile w/o LTO and LTO final build.  This is
recommended way anyway as LTO -fprofile-generae build is unnecesarily
expensive.

What is the official way of building mozilla with FDO?

Does the non-FDO problem persist for you?  The Jul 10 commit was part of
longer series of alias rewrite and I fixed some of fallout afterwards (and
was able to build mozilla).  Didn't see the particular problem you report
however.

Honza
Comment 105 Markus Trippelsdorf 2011-06-18 10:18:09 UTC
(In reply to comment #104)
> > Even with 8c1fce46fc0 reverted libxul fails to link during
> > a profiledbuild. Normal build is fine.
> 
> I didn't really tested profiledbuild for a while, so I will check.
> Last time I tried we was able to build libxul but still had problems
> building one of later libraries because of the COMDAT issues. I filled
> Mozilla PR for that (the problem really is including some classes but not
> linking with their implementation).
> 
> What worked well for me is to profile w/o LTO and LTO final build.  This is
> recommended way anyway as LTO -fprofile-generae build is unnecesarily
> expensive.

Yes, that how I run things normally. too.

> What is the official way of building mozilla with FDO?

(Here is what I use:)

make -f client.mk profiledbuild

with the following appended to your .mozconfig:
ac_add_options --enable-profile-guided-optimization
mk_add_options PROFILE_GEN_SCRIPT=/home/markus/run-firefox.sh

~ % cat run-firefox.sh
#!/bin/sh
export NO_EM_RESTART=1
sudo -u markus $OBJDIR/dist/bin/firefox -no-remote

This will start the instrumented firefox. Use it for some time. After 
you close it, the final -fprofile-use build starts.

> Does the non-FDO problem persist for you?  The Jul 10 commit was part of
> longer series of alias rewrite and I fixed some of fallout afterwards (and
> was able to build mozilla).  Didn't see the particular problem you report
> however.

I only see the problems during a FDO build, non-FDO is fine.
(But because it turned out that both issues have nothing to do with LTO maybe it would be better to file a new bug for them?)
Comment 106 Markus Trippelsdorf 2011-06-26 19:50:20 UTC
I've opened a new bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49533
with a patch that fixes the issue seen in Comment 99.
Comment 107 Jan Hubicka 2011-08-04 19:16:29 UTC
Now my build dies on what appears to be configure confussion:
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:43:17: error: 'close' was not declared in this scope
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:52:26: error: 'read' was not declared in this scope
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:52:26: error: invalid type in declaration before ';' token
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:62:39: error: 'write' was not declared in this scope
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:62:39: error: invalid type in declaration before ';' token
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:74:7: error: 'close' was not declared in this scope
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:74:7: error: invalid type in declaration before ';' token
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:76:7: error: 'close' was not declared in this scope
/abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:76:7: error: invalid type in declaration before ';' token

While I could definitely get around by adding the proper #includes, it seems that things simply gets misconfigured.

Martin, you mentioned similar problem earlier, perhaps you already have solution?
Comment 108 Andrew Pinski 2011-08-04 19:22:50 UTC
(In reply to comment #107)
> Now my build dies on what appears to be configure confussion:
> /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:43:17:
> error: 'close' was not declared in this scope

Actually I think this was caused by the removal of the #include of unistd.h in gthr-posix.h which means the version of mozilla you are trying to use has not be updated for that change.
Comment 110 Martin Jambor 2011-08-05 13:47:30 UTC
(In reply to comment #107)
> 
> Martin, you mentioned similar problem earlier, perhaps you already have
> solution?

I went for adding the includes.  I wasn't looking into dependencies in
much detail and ended up just adding #include <unistd.h> to:
    - ipc/chromium/src/base/file_util.cc
    - ipc/chromium/src/base/message_pump_libevent.cc
    - ipc/chromium/src/base/file_util_linux.cc
    - toolkit/crashreporter/client/crashreporter_gtk_common.cpp

However, I also suspected some configure problem because I also had to
tweak #if's in ipc/chromium/src/base/time_posix.cc.

The patch that I use to do this is at
http://labs.suse.cz/mjambor/undefined_and_pp_errors.diff

In order to LTO build mozilla I currently need this one, a patch
adding attribute used to various places I got from you and a simple
patch fixing mozilla bug 652563.
Comment 111 Jan Hubicka 2011-09-27 20:48:19 UTC
Mozilla now builds for me with slim LTO objects. I.e. with -flto=24 -fuse-linker-plugin -fno-fat-lto-objects
One needs ar/nm/ranlib that works with slim LTO. I simply set PATH to directory with following scripts:
jh@evans:/abuild/jh/trunk-install/bin> cat nm
#!/bin/sh
/usr/bin/nm --plugin /abuild/jh/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/liblto_plugin.so $*
jh@evans:/abuild/jh/trunk-install/bin> cat ar
#!/bin/sh
cmd=$1
shift
/abuild/jh/trunk-install/bin/ar-with-plugin $cmd --plugin /abuild/jh/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/liblto_plugin.so $*
jh@evans:/abuild/jh/trunk-install/bin> cat ranlib
#!/bin/sh
jh@evans:/abuild/jh/trunk-install/bin> 

If I was not lazy to rebuild ranlib, I think it exists with plugin support now, too.  Just disabling it was however equally easy.
I will do some benchmarks about build time/disk usage.

Resulting binary works too, BTW :)
Comment 112 Jan Hubicka 2011-09-28 13:33:03 UTC
OK, the problem turns out to be configure issue.  Configure script greps asm output and with slim LTO it does not find there what it expects disabling hidden visibilities. No surprise this leads to a performance disaster.  I use the following hack:
diff -r 06b2977afb85 configure.in
--- a/configure.in      Fri Sep 09 23:25:02 2011 -0400
+++ b/configure.in      Wed Sep 28 15:30:56 2011 +0200
@@ -3035,7 +3035,7 @@
                   int foo __attribute__ ((visibility ("hidden"))) = 1;
 EOF
                   ac_cv_visibility_hidden=no
-                  if ${CC-cc} -Werror -S conftest.c -o conftest.s >/dev/null 2>&1; then
+                  if ${CC-cc} -Werror -S -fno-lto conftest.c -o conftest.s >/dev/null 2>&1; then
                     if egrep '\.(hidden|private_extern).*foo' conftest.s >/dev/null; then
                       ac_cv_visibility_hidden=yes
                     fi
@@ -3051,7 +3051,7 @@
                     int foo __attribute__ ((visibility ("default"))) = 1;
 EOF
                     ac_cv_visibility_default=no
-                    if ${CC-cc} -fvisibility=hidden -Werror -S conftest.c -o conftest.s >/dev/null 2>&1; then
+                    if ${CC-cc} -fvisibility=hidden -Werror -S -fno-lto conftest.c -o conftest.s >/dev/null 2>&1; then
                       if ! egrep '\.(hidden|private_extern).*foo' conftest.s >/dev/null; then
                         ac_cv_visibility_default=yes
                       fi
@@ -3070,7 +3070,7 @@
                       int foo_default = 1;
 EOF
                       ac_cv_visibility_pragma=no
-                      if ${CC-cc} -Werror -S conftest.c -o conftest.s >/dev/null 2>&1; then
+                      if ${CC-cc} -Werror -S -fno-lto conftest.c -o conftest.s >/dev/null 2>&1; then
                         if egrep '\.(hidden|private_extern).*foo_hidden' conftest.s >/dev/null; then
                           if ! egrep '\.(hidden|private_extern).*foo_default' conftest.s > /dev/null; then
                             ac_cv_visibility_pragma=yes
@@ -3092,7 +3092,7 @@
 }
 EOF
                        ac_cv_have_visibility_class_bug=no
-                       if ! ${CXX-g++} ${CXXFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -S -o conftest.S conftest.c > /dev/null 2>&1 ; then
+                       if ! ${CXX-g++} ${CXXFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -S -fno-lto -o conftest.S conftest.c > /dev/null 2>&1 ; then
                          ac_cv_have_visibility_class_bug=yes
                        else
                          if test `egrep -c '@PLT|\\$stub' conftest.S` = 0; then
@@ -3116,7 +3116,7 @@
 }
 EOF
                        ac_cv_have_visibility_builtin_bug=no
-                       if ! ${CC-cc} ${CFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -O2 -S -o conftest.S conftest.c > /dev/null 2>&1 ; then
+                       if ! ${CC-cc} ${CFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -O2 -S -fno-lto -o conftest.S conftest.c > /dev/null 2>&1 ; then
                          ac_cv_have_visibility_builtin_bug=yes
                        else
                          if test `grep -c "@PLT" conftest.S` = 0; then
Comment 113 Jan Hubicka 2011-09-29 16:24:56 UTC
Even with PR47247 solved, -fprofile-generate -flto build fails at

libbrowsercomps.so.ltrans23.ltrans.o:libbrowsercomps.so.ltrans23.o:function _ZTV17gfxUnknownSurface.local.706.2371: error: undefined reference to '_ZN11gfxASurface13BeginPrintingERK9nsAStringS2_'

-fprofile-generate -flto is stupid, since one can profile w/o LTO and get a lot faster build. (We also need 15GB for libxul link). Still it seems that we miss some optimization we ought not.
Comment 114 Jan Hubicka 2011-10-01 13:18:30 UTC
So quick summary
 1) -g build is still blocked by dwarf2out ICE
 2) build with gold works, but only without -fprofile-generate. FDO build is also possible, but -fprofile-generate needs -fno-lto (that makes a lot of sense, but we still should fix the bug at GCC side)
 3) With GNU LD, there is still bug that blocks Mozilla LTO
    http://sourceware.org/bugzilla/show_bug.cgi?id=13244
 4) Slim LTO works well. Build times are about the same as for non-LTO. One needs the aforementioned configure hacks and ar/nm/ranlib wrappers.

Honza
Comment 115 Jan Hubicka 2011-10-01 15:28:46 UTC
OK the same errors also happens with GNU LD build
http://sourceware.org/bugzilla/show_bug.cgi?id=13244
https://bugzilla.mozilla.org/show_bug.cgi?id=691053

I will analyze what happens with -fprofile-generate and gold but I bet it all fails because we now take address of the constructor and consequentely the constructor is exported out of libxul, but visibilities are wrong.

Honza
Comment 116 Jan Hubicka 2011-10-01 15:52:51 UTC
Solving http://sourceware.org/bugzilla/show_bug.cgi?id=13245
should make that linker error with -flto -fprofile-generate to go away.
Comment 117 Markus Trippelsdorf 2011-10-11 07:39:43 UTC
"-flto=4 -fno-fat-lto-objects -fprofile-use -fprofile-correction" breaks 
at js/src/xpconnect/src/dombindings.cpp:

...
In file included from /var/tmp/mozilla-central/js/src/xpconnect/src/dombindings.cpp:1109:0:
./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_Add(JSContext*, unsigned int, JS::Value*)’:
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL25HTMLOptionsCollection_AddEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL25HTMLOptionsCollection_AddEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_SetSelectedIndex(JSContext*, JSObject*, long, int, JS::Value*)’:
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_SetSelectedIndexEP9JSContextP8JSObjectliPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_SetSelectedIndexEP9JSContextP8JSObjectliPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_GetSelectedIndex(JSContext*, JSObject*, long, JS::Value*)’:
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_GetSelectedIndexEP9JSContextP8JSObjectlPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_GetSelectedIndexEP9JSContextP8JSObjectlPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_Item(JSContext*, unsigned int, JS::Value*)’:
./dombindings_gen.cpp:546:1: warning: no coverage for function ‘_ZN7mozilla3dom7bindingL26HTMLOptionsCollection_ItemEP9JSContextjPN2JS5ValueE’ found [enabled by default]
./dombindings_gen.cpp:546:1: warning: no coverage for function ‘_ZN7mozilla3dom7bindingL26HTMLOptionsCollection_ItemEP9JSContextjPN2JS5ValueE’ found [enabled by default]
./dombindings_gen.cpp: In member function ‘nsCOMPtr<nsIDOMNode>::~nsCOMPtr()’:
./dombindings_gen.cpp:546:1: warning: no coverage for function ‘_ZN8nsCOMPtrI10nsIDOMNodeED2Ev’ found [enabled by default]
./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_NamedItem(JSContext*, unsigned int, JS::Value*)’:
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL31HTMLOptionsCollection_NamedItemEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL31HTMLOptionsCollection_NamedItemEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch]
./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot
cc1plus: some warnings being treated as errors
Comment 118 Markus Trippelsdorf 2011-10-11 12:18:21 UTC
Probably a Mozilla bug. See:
https://bugzilla.mozilla.org/show_bug.cgi?id=693563
Comment 119 Jan Hubicka 2011-10-19 09:22:01 UTC
Some up to date perfomrance data.  WPA peaks 3.1GB in TOP now. (3261 virt). Overall compile time is 4m32s real, 21m14 user.
GGC memory is GC 2248537k -> 1727826k

WPA time report:
 callgraph optimization  :   1.68 ( 1%) usr   0.00 ( 0%) sys   1.70 ( 1%) wall   16008 kB (11%) ggc
 varpool construction    :   0.66 ( 0%) usr   0.02 ( 0%) sys   0.68 ( 0%) wall   55300 kB (39%) ggc
 ipa cp                  :   1.70 ( 1%) usr   0.09 ( 1%) sys   1.79 ( 1%) wall   75845 kB (53%) ggc
 ipa lto gimple out      :   9.40 ( 6%) usr   0.91 (10%) sys  10.36 ( 6%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  45.99 (29%) usr   1.66 (19%) sys  47.95 (28%) wall 3285797 kB (2315%) ggc
 ipa lto decl out        :  35.61 (22%) usr   1.65 (19%) sys  37.23 (22%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   3.73 ( 2%) usr   0.22 ( 2%) sys   3.95 ( 2%) wall  621046 kB (438%) ggc
 ipa lto decl merge      :   5.75 ( 4%) usr   0.00 ( 0%) sys   5.75 ( 3%) wall     803 kB ( 1%) ggc
 ipa lto cgraph merge    :   2.79 ( 2%) usr   0.02 ( 0%) sys   2.81 ( 2%) wall   27731 kB (20%) ggc
 inline heuristics       :  31.32 (19%) usr   0.13 ( 1%) sys  31.48 (18%) wall  252282 kB (178%) ggc
 TOTAL                 : 161.21             8.82           170.40             141952 kB

(i.e. 60% of overall compilation time and about 1/3 if streaming in 1/3 of straming out and 1/5th for inliner).

oprofile of streaming in:
9467      6.8109  lto1                     htab_find_slot_with_hash
9036      6.5008  lto1                     inflate_fast
6608      4.7540  libc-2.11.1.so           memset
6256      4.5008  libc-2.11.1.so           _int_malloc
6243      4.4914  lto1                     pointer_map_insert
5694      4.0965  lto1                     lto_input_tree
5014      3.6072  lto1                     gt_ggc_mx_lang_tree_node
4522      3.2533  lto1                     streamer_read_tree_bitfields
4463      3.2108  lto1                     ggc_set_mark
4087      2.9403  opreport                 /usr/bin/opreport
3661      2.6339  lto1                     ggc_internal_alloc_stat
3475      2.5000  lto1                     streamer_read_uhwi
2508      1.8043  lto1                     gimple_type_eq
2418      1.7396  lto1                     streamer_read_tree_body
2310      1.6619  libc-2.11.1.so           memcpy
2292      1.6489  lto1                     streamer_tree_cache_insert_1
2255      1.6223  libc-2.11.1.so           memcmp
2119      1.5245  lto1                     ht_lookup_with_hash
1902      1.3684  lto1                     iterative_hash_hashval_t
1885      1.3561  lto1                     lto_fixup_types
1884      1.3554  libc-2.11.1.so           _int_free
1872      1.3468  lto1                     uniquify_nodes
1842      1.3252  lto1                     htab_expand
1825      1.3130  oprofiled                /usr/bin/oprofiled
1813      1.3043  lto1                     adler32
1734      1.2475  lto1                     htab_hash_string
1509      1.0856  libc-2.11.1.so           _IO_vfscanf
1470      1.0576  libc-2.11.1.so           malloc_consolidate

pointer map and htab is mostly type merging still, I believe.

oprofile of inliner:
8772     37.9215  lto1                     edge_badness
5532     23.9149  lto1                     do_estimate_growth_1
1647      7.1200  lto1                     update_caller_keys
1484      6.4154  lto1                     can_inline_edge_p
744       3.2163  lto1                     estimate_calls_size_and_time.isra.32
509       2.2004  lto1                     estimate_edge_size_and_time.constprop.65
495       2.1399  lto1                     fibheap_consolidate
267       1.1542  lto1                     fibheap_extr_min_node
210       0.9078  lto1                     cgraph_maybe_hot_edge_p

I.e. easy to handle by taming down amout of heap updating.

Stream out:
33711    19.7166  lto1                     lto1                     varpool_node_for_asm
13947     8.1572  lto1                     lto1                     decl_assembler_name_equal
8873      5.1896  lto1                     lto1                     pointer_map_insert
8765      5.1264  lto1                     lto1                     linemap_lookup
6809      3.9824  lto1                     lto1                     lto_output_tree
4931      2.8840  lto1                     lto1                     inflate_fast
4718      2.7594  lto1                     lto1                     streamer_write_uhwi_stream
3521      2.0593  lto1                     lto1                     streamer_tree_cache_insert_1
3340      1.9535  lto1                     lto1                     splay_tree_splay
3293      1.9260  lto1                     lto1                     streamer_pack_tree_bitfields
3210      1.8774  libc-2.11.1.so           libc-2.11.1.so           memcpy
3175      1.8570  libc-2.11.1.so           libc-2.11.1.so           _int_malloc

The assembler name lookups will go away with finishing the alias rewrite.

Oprofile of ltrans stage:
52827     3.3333  lto1                     lto1                     value_member
45691     2.8830  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
42528     2.6835  lto1                     lto1                     bitmap_set_bit
41934     2.6460  oprofiled                oprofiled                /usr/bin/oprofiled
22353     1.4104  libc-2.11.1.so           libc-2.11.1.so           memset
21573     1.3612  lto1                     lto1                     htab_find_slot_with_hash
20936     1.3210  lto1                     lto1                     ggc_internal_alloc_stat
19608     1.2372  lto1                     lto1                     record_reg_classes.constprop.10
17423     1.0994  lto1                     lto1                     bitmap_bit_p
17195     1.0850  lto1                     lto1                     for_each_rtx_1
13504     0.8521  libc-2.11.1.so           libc-2.11.1.so           _int_free
12343     0.7788  lto1                     lto1                     bitmap_clear_bit
11826     0.7462  lto1                     lto1                     constrain_operands


The slowest of ltrans is:
 garbage collection      :   1.69 ( 2%) usr   0.01 ( 0%) sys   1.72 ( 2%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   1.52 ( 2%) usr   0.45 ( 9%) sys   1.94 ( 2%) wall  212002 kB (11%) ggc
 ipa lto decl in         :   1.61 ( 2%) usr   0.19 ( 4%) sys   1.81 ( 2%) wall  147115 kB ( 7%) ggc
 cfg cleanup             :   1.46 ( 2%) usr   0.03 ( 1%) sys   1.60 ( 2%) wall    5376 kB ( 0%) ggc
 df live regs            :   2.26 ( 3%) usr   0.03 ( 1%) sys   2.62 ( 3%) wall       0 kB ( 0%) ggc
 tree VRP                :   2.04 ( 2%) usr   0.05 ( 1%) sys   2.34 ( 2%) wall  126142 kB ( 6%) ggc
 tree PTA                :   1.97 ( 2%) usr   0.00 ( 0%) sys   2.43 ( 3%) wall    8733 kB ( 0%) ggc
 tree PRE                :   2.98 ( 3%) usr   0.07 ( 1%) sys   3.83 ( 4%) wall   64875 kB ( 3%) ggc
 tree FRE                :   1.50 ( 2%) usr   0.01 ( 0%) sys   1.98 ( 2%) wall   33609 kB ( 2%) ggc
 expand                  :   4.11 ( 5%) usr   0.11 ( 2%) sys   4.85 ( 5%) wall  138280 kB ( 7%) ggc
 CSE                     :   1.88 ( 2%) usr   0.04 ( 1%) sys   2.16 ( 2%) wall    2764 kB ( 0%) ggc
 CPROP                   :   1.83 ( 2%) usr   0.04 ( 1%) sys   1.87 ( 2%) wall   21657 kB ( 1%) ggc
 integrated RA           :   6.84 ( 8%) usr   0.08 ( 2%) sys   7.30 ( 8%) wall  367479 kB (19%) ggc
 reload                  :   2.47 ( 3%) usr   0.04 ( 1%) sys   2.82 ( 3%) wall    8783 kB ( 0%) ggc
 reload CSE regs         :   2.03 ( 2%) usr   0.01 ( 0%) sys   2.02 ( 2%) wall   19115 kB ( 1%) ggc
 scheduling 2            :   3.08 ( 3%) usr   0.03 ( 1%) sys   3.14 ( 3%) wall    3942 kB ( 0%) ggc
 final                   :  11.46 (13%) usr   1.06 (21%) sys   3.62 ( 4%) wall   40822 kB ( 2%) ggc
 rest of compilation     :   2.97 ( 3%) usr   0.87 (17%) sys   5.22 ( 5%) wall   60101 kB ( 3%) ggc
 unaccounted todo        :   1.35 ( 2%) usr   0.67 (13%) sys   2.37 ( 2%) wall       0 kB ( 0%) ggc
 TOTAL                 :  89.65             5.08            95.59            1962376 kB

Final is suprisingly slow.
Comment 120 Jan Hubicka 2011-10-19 13:05:25 UTC
weakref reorg saves about 15 seconds, so we have total WPA time 145s and decl out at 19s (13%).

Honza
Comment 121 Jan Hubicka 2012-05-10 21:45:10 UTC
With inliner performance fix I am going to push out today, the situation looks as follows:
Execution times (seconds)
 phase parsing           : 606.20 (98%) usr  21.98 (99%) sys 641.28 (98%) wall 2164274 kB (100%) ggc
 phase cgraph            : 337.00 (55%) usr  18.52 (83%) sys 367.32 (56%) wall   88841 kB ( 4%) ggc
 phase finalize          :  10.21 ( 2%) usr   0.28 ( 1%) sys  10.50 ( 2%) wall       0 kB ( 0%) ggc
 garbage collection      :  33.12 ( 5%) usr   0.04 ( 0%) sys  33.21 ( 5%) wall       0 kB ( 0%) ggc
 ipa cp                  :   3.52 ( 1%) usr   0.15 ( 1%) sys   3.67 ( 1%) wall   93737 kB ( 4%) ggc
 ipa lto gimple out      :  14.43 ( 2%) usr   1.38 ( 6%) sys  15.89 ( 2%) wall       0 kB ( 0%) ggc
 ipa lto decl in         : 221.85 (36%) usr   2.52 (11%) sys 225.61 (35%) wall 1153296 kB (53%) ggc
 ipa lto decl out        : 179.65 (29%) usr   8.60 (39%) sys 198.90 (31%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   4.59 ( 1%) usr   0.50 ( 2%) sys   5.09 ( 1%) wall  550051 kB (25%) ggc
 ipa lto decl merge      :   9.57 ( 2%) usr   0.00 ( 0%) sys   9.58 ( 1%) wall     291 kB ( 0%) ggc
 ipa lto cgraph merge    :   6.06 ( 1%) usr   0.00 ( 0%) sys   6.08 ( 1%) wall   14158 kB ( 1%) ggc
 whopr wpa               :   6.44 ( 1%) usr   0.06 ( 0%) sys   6.54 ( 1%) wall       2 kB ( 0%) ggc
 whopr wpa I/O           :   2.77 ( 0%) usr   8.03 (36%) sys  11.56 ( 2%) wall       0 kB ( 0%) ggc
 ipa reference           :   5.16 ( 1%) usr   0.08 ( 0%) sys   5.25 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.55 ( 0%) usr   0.00 ( 0%) sys   0.55 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   5.59 ( 1%) usr   0.02 ( 0%) sys   5.61 ( 1%) wall       0 kB ( 0%) ggc
 parser (global)         :   3.98 ( 1%) usr   0.04 ( 0%) sys   4.04 ( 1%) wall       0 kB ( 0%) ggc
 inline heuristics       :  94.38 (15%) usr   0.31 ( 1%) sys  94.90 (15%) wall  342900 kB (16%) ggc
 tree CFG cleanup        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 callgraph verifier      :  18.53 ( 3%) usr   0.08 ( 0%) sys  18.61 ( 3%) wall       0 kB ( 0%) ggc
 varconst                :   0.04 ( 0%) usr   0.03 ( 0%) sys   0.14 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   4.70 ( 1%) usr   0.10 ( 0%) sys   4.81 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 : 616.43            22.26           651.79            2165706 kB

So memory use is somewhat up (4GB compared to 3.2GB) but Mozilla grew a bit, too, so I think there are no important changes since my last report.

Performance wise we are in better shape than 4.7 release (I will backport the fix, 4.7 needs over 10 minutes in the inliner) but we still are way too slow, with over 3 minutes needed for streaming in..
Comment 122 Jan Hubicka 2012-05-10 21:53:54 UTC
oprofile shows:
139188   15.6963  lto1                     lto1                     uniquify_nodes
66390     7.4868  lto1                     lto1                     estimate_edge_growth
52815     5.9560  lto1                     lto1                     VEC_edge_growth_cache_entry_base_length
47137     5.3157  lto1                     lto1                     iterative_hash_hashval_t
34037     3.8384  lto1                     lto1                     htab_find_slot_with_hash
33604     3.7895  lto1                     lto1                     bp_unpack_value
26584     2.9979  lto1                     lto1                     do_estimate_growth_1
21410     2.4144  lto1                     lto1                     ggc_set_mark
17124     1.9311  lto1                     lto1                     inflate_fast
14464     1.6311  lto1                     lto1                     streamer_read_uhwi
14204     1.6018  lto1                     lto1                     lookup_page_table_entry
11430     1.2890  libc-2.11.1.so           libc-2.11.1.so           memset
11405     1.2861  lto1                     lto1                     streamer_read_hwi_in_range
11286     1.2727  lto1                     lto1                     gt_ggc_mx_lang_tree_node
11017     1.2424  lto1                     lto1                     iterative_hash_gimple_type
10851     1.2237  lto1                     lto1                     pointer_map_insert
10674     1.2037  lto1                     lto1                     lto_input_tree
10536     1.1881  lto1                     lto1                     ht_lookup_with_hash
10269     1.1580  lto1                     lto1                     streamer_read_uchar
9972      1.1245  lto1                     lto1                     streamer_read_uchar
9089      1.0250  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
9086      1.0246  lto1                     lto1                     alloc_page
6603      0.7446  lto1                     lto1                     VEC_edge_growth_cache_entry_base_index

looks like uniquify_nodes got out of control?
Comment 123 Markus Trippelsdorf 2012-05-11 05:55:43 UTC
Just for comparison, clang with -O4 runs only single threaded and does everything in memory (no streaming out). It uses 3.5GB of memory (peak) and takes 19 minutes to finish...
Comment 124 Jan Hubicka 2012-05-11 08:34:17 UTC
> Just for comparison, clang with -O4 runs only single threaded and does
> everything in memory (no streaming out). It uses 3.5GB of memory (peak) and
> takes 19 minutes to finish...

Interesting.  Micsofot's compiler is also barely in 4GB space, right?
Is it with debug info?

I will try non-WHOPR build to see how bad we are.  The actual IL is about 1.5GB
of the footprint (measuing GGC memory).  I think good part of the rest comes to mmap
address space (the object files are rather large).

Honza
Comment 125 Richard Biener 2012-05-11 08:44:51 UTC
(In reply to comment #122)
> oprofile shows:
> 139188   15.6963  lto1                     lto1                    
> uniquify_nodes
> 66390     7.4868  lto1                     lto1                    
> estimate_edge_growth
> 52815     5.9560  lto1                     lto1                    
> VEC_edge_growth_cache_entry_base_length
> 47137     5.3157  lto1                     lto1                    
> iterative_hash_hashval_t
> 34037     3.8384  lto1                     lto1                    
> htab_find_slot_with_hash
> 33604     3.7895  lto1                     lto1                    
> bp_unpack_value
> 26584     2.9979  lto1                     lto1                    
> do_estimate_growth_1
> 21410     2.4144  lto1                     lto1                    
> ggc_set_mark
> 17124     1.9311  lto1                     lto1                    
> inflate_fast
> 14464     1.6311  lto1                     lto1                    
> streamer_read_uhwi
> 14204     1.6018  lto1                     lto1                    
> lookup_page_table_entry
> 11430     1.2890  libc-2.11.1.so           libc-2.11.1.so           memset
> 11405     1.2861  lto1                     lto1                    
> streamer_read_hwi_in_range
> 11286     1.2727  lto1                     lto1                    
> gt_ggc_mx_lang_tree_node
> 11017     1.2424  lto1                     lto1                    
> iterative_hash_gimple_type
> 10851     1.2237  lto1                     lto1                    
> pointer_map_insert
> 10674     1.2037  lto1                     lto1                    
> lto_input_tree
> 10536     1.1881  lto1                     lto1                    
> ht_lookup_with_hash
> 10269     1.1580  lto1                     lto1                    
> streamer_read_uchar
> 9972      1.1245  lto1                     lto1                    
> streamer_read_uchar
> 9089      1.0250  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
> 9086      1.0246  lto1                     lto1                     alloc_page
> 6603      0.7446  lto1                     lto1                    
> VEC_edge_growth_cache_entry_base_index
> 
> looks like uniquify_nodes got out of control?

Well - the obvious possibly "slow" part of uniquify nodes is that it walks
all fields of record/union types.  So - do you have a more detailed profile
of uniquify_nodes?
Comment 126 Markus Trippelsdorf 2012-05-11 08:46:39 UTC
(In reply to comment #124)
> > Just for comparison, clang with -O4 runs only single threaded and does
> > everything in memory (no streaming out). It uses 3.5GB of memory (peak) and
> > takes 19 minutes to finish...
> 
> Interesting.  Micsofot's compiler is also barely in 4GB space, right?

IIRC Mozilla recently switched to a 64-bit toolchain on windows, because the
32-bit linker ran out of memory. So they are above 4GB already.

> Is it with debug info?

No.
Comment 127 Mike Hommey 2012-05-11 08:52:24 UTC
(In reply to comment #126)
> (In reply to comment #124)
> > > Just for comparison, clang with -O4 runs only single threaded and does
> > > everything in memory (no streaming out). It uses 3.5GB of memory (peak) and
> > > takes 19 minutes to finish...
> > 
> > Interesting.  Micsofot's compiler is also barely in 4GB space, right?
> 
> IIRC Mozilla recently switched to a 64-bit toolchain on windows, because the
> 32-bit linker ran out of memory. So they are above 4GB already.

There is unfortunately no cross-linker in MSVC, so you can't link 32-bit binaries with a 64-bit toolchain. We're in the process of switching to 64-bits OS with a 32-its toolchain, which will allow an extra gigabyte of address-space. We've gone past the current 3GB limit a couple times now, at which point, we moved some stuff out of libxul. Before that, we hit the 2GB limit, at which point we used the /3GB option that allows for the extra GB.
Comment 128 Jan Hubicka 2012-05-11 08:52:50 UTC
> Well - the obvious possibly "slow" part of uniquify nodes is that it walks
> all fields of record/union types.  So - do you have a more detailed profile
> of uniquify_nodes?

No, I will try to generate annotated sources then.  I am bit puzzled by this -
looking at the stuff there seems nothing inherently expensive in it.

Honza
Comment 129 Jan Hubicka 2012-05-11 19:05:19 UTC
OK, the slow part of uniuqify_nodes is:
          /* Remove us from our main variant list if we are not the
             variant leader.  */
          if (TYPE_MAIN_VARIANT (t) != t)
            { 
              tem = TYPE_MAIN_VARIANT (t);
              while (tem && TYPE_NEXT_VARIANT (tem) != t)
                tem = TYPE_NEXT_VARIANT (tem);
              if (tem)
                TYPE_NEXT_VARIANT (tem) = TYPE_NEXT_VARIANT (t);
              TYPE_NEXT_VARIANT (t) = NULL_TREE;
            }
Comment 130 Jan Hubicka 2012-05-12 14:44:47 UTC
After fixing one linker error, I can now build Mozilla with -flto-partition=none.  It takes 11GB and 40 minutes, so there is space for improvement ;)

There are some obvious questions, like why IRA needs 63% of GGC memory, and VRP 23%

Also the -flto-partition=none .text section is now 18% smaller.  This is large enough to be declared a bug, but I am not sure how to track it.

Note that my macihne has quite poor since CPU performance, so the compile times are likely not comparable with LLVM ones reported above (and I also use debugging build).

 ipa lto gimple in       :  52.12 ( 2%) usr   3.68 ( 9%) sys  55.72 ( 2%) wall 2998249 kB (84%) ggc
 ipa lto decl in         : 225.68 ( 8%) usr   2.39 ( 6%) sys 228.17 ( 8%) wall 1124821 kB (31%) ggc
 ipa lto cgraph I/O      :   4.82 ( 0%) usr   0.44 ( 1%) sys   5.27 ( 0%) wall  684110 kB (19%) ggc
 cfg construction        :   3.01 ( 0%) usr   0.12 ( 0%) sys   3.29 ( 0%) wall   70205 kB ( 2%) ggc
 cfg cleanup             :  46.57 ( 2%) usr   0.41 ( 1%) sys  46.69 ( 2%) wall   75005 kB ( 2%) ggc
 df live regs            :  78.21 ( 3%) usr   0.25 ( 1%) sys  77.55 ( 3%) wall       0 kB ( 0%) ggc
 alias analysis          :  25.59 ( 1%) usr   0.12 ( 0%) sys  25.88 ( 1%) wall  474769 kB (13%) ggc
 parser (global)         :   8.62 ( 0%) usr   0.65 ( 2%) sys  10.00 ( 0%) wall  259389 kB ( 7%) ggc
 inline heuristics       :  87.23 ( 3%) usr   0.51 ( 1%) sys  88.41 ( 3%) wall  451358 kB (13%) ggc
 integration             :  50.61 ( 2%) usr   1.51 ( 4%) sys  52.67 ( 2%) wall 1479979 kB (41%) ggc
 tree CFG cleanup        :  46.68 ( 2%) usr   0.43 ( 1%) sys  48.09 ( 2%) wall   70493 kB ( 2%) ggc
 tree VRP                :  65.88 ( 2%) usr   0.73 ( 2%) sys  66.71 ( 2%) wall  862879 kB (24%) ggc
 tree copy propagation   :  22.30 ( 1%) usr   0.17 ( 0%) sys  22.11 ( 1%) wall  144298 kB ( 4%) ggc
 tree PTA                :  46.70 ( 2%) usr   0.06 ( 0%) sys  46.90 ( 2%) wall  100249 kB ( 3%) ggc
 tree SSA rewrite        :  19.16 ( 1%) usr   0.15 ( 0%) sys  19.09 ( 1%) wall  149347 kB ( 4%) ggc
 tree SSA incremental    :  27.75 ( 1%) usr   0.61 ( 1%) sys  27.86 ( 1%) wall   72307 kB ( 2%) ggc
 tree operand scan       :  57.17 ( 2%) usr   3.03 ( 7%) sys  59.92 ( 2%) wall 1296208 kB (36%) ggc
 dominator optimization  :  35.95 ( 1%) usr   0.21 ( 0%) sys  35.74 ( 1%) wall  311024 kB ( 9%) ggc
 tree CCP                :  31.61 ( 1%) usr   0.12 ( 0%) sys  31.17 ( 1%) wall  111169 kB ( 3%) ggc
 tree PRE                :  87.46 ( 3%) usr   0.60 ( 1%) sys  88.62 ( 3%) wall  538859 kB (15%) ggc
 tree FRE                :  47.37 ( 2%) usr   0.58 ( 1%) sys  45.89 ( 2%) wall  274455 kB ( 8%) ggc
 tree aggressive DCE     :   8.96 ( 0%) usr   0.22 ( 1%) sys   8.86 ( 0%) wall  137686 kB ( 4%) ggc
 tree forward propagate  :  10.28 ( 0%) usr   0.10 ( 0%) sys  10.33 ( 0%) wall   56466 kB ( 2%) ggc
 tree slp vectorization  :  25.42 ( 1%) usr   0.16 ( 0%) sys  25.50 ( 1%) wall  436119 kB (12%) ggc
 complete unrolling      :   5.81 ( 0%) usr   0.13 ( 0%) sys   6.07 ( 0%) wall  115165 kB ( 3%) ggc
 tree vectorization      :   1.44 ( 0%) usr   0.05 ( 0%) sys   1.36 ( 0%) wall   31337 kB ( 1%) ggc
 tree iv optimization    :  13.00 ( 0%) usr   0.08 ( 0%) sys  12.94 ( 0%) wall  185893 kB ( 5%) ggc
 dominance computation   :  48.61 ( 2%) usr   0.54 ( 1%) sys  47.65 ( 2%) wall       0 kB ( 0%) ggc
 expand vars             :  18.81 ( 1%) usr   0.09 ( 0%) sys  18.42 ( 1%) wall  167798 kB ( 5%) ggc
 expand                  : 116.32 ( 4%) usr   0.61 ( 1%) sys 116.22 ( 4%) wall 1508612 kB (42%) ggc
 forward prop            :  23.01 ( 1%) usr   0.36 ( 1%) sys  23.43 ( 1%) wall  130825 kB ( 4%) ggc
 CSE                     :  67.21 ( 2%) usr   0.23 ( 1%) sys  66.28 ( 2%) wall   44439 kB ( 1%) ggc
 dead store elim1        :  20.47 ( 1%) usr   0.10 ( 0%) sys  20.83 ( 1%) wall  103309 kB ( 3%) ggc
 dead store elim2        :  18.99 ( 1%) usr   0.18 ( 0%) sys  20.48 ( 1%) wall  140398 kB ( 4%) ggc
 CPROP                   :  52.83 ( 2%) usr   0.33 ( 1%) sys  52.91 ( 2%) wall  336514 kB ( 9%) ggc
 PRE                     :  30.60 ( 1%) usr   0.06 ( 0%) sys  30.51 ( 1%) wall   52724 kB ( 1%) ggc
 CSE 2                   :  37.89 ( 1%) usr   0.04 ( 0%) sys  38.88 ( 1%) wall   29785 kB ( 1%) ggc
 combiner                :  80.20 ( 3%) usr   0.23 ( 1%) sys  80.57 ( 3%) wall  400168 kB (11%) ggc
 integrated RA           : 191.13 ( 7%) usr   0.44 ( 1%) sys 190.64 ( 7%) wall 2328880 kB (65%) ggc
 reload                  :  65.46 ( 2%) usr   0.09 ( 0%) sys  67.43 ( 2%) wall  193522 kB ( 5%) ggc
 reload CSE regs         :  56.71 ( 2%) usr   0.14 ( 0%) sys  56.49 ( 2%) wall  241394 kB ( 7%) ggc
 thread pro- & epilogue  :  14.43 ( 1%) usr   0.15 ( 0%) sys  14.97 ( 1%) wall  201098 kB ( 6%) ggc
 final                   :  44.77 ( 2%) usr   2.80 ( 6%) sys  48.99 ( 2%) wall  367580 kB (10%) ggc
 rest of compilation     :  57.58 ( 2%) usr   6.23 (14%) sys  63.50 ( 2%) wall  337908 kB ( 9%) ggc
 remove unused locals    :  41.68 ( 2%) usr   0.15 ( 0%) sys  42.04 ( 1%) wall     333 kB ( 0%) ggc
 TOTAL                 :2768.94            43.11          2814.85            3588723 kB
Comment 131 Steven Bosscher 2012-05-12 15:52:54 UTC
(In reply to comment #130)
> There are some obvious questions, like why IRA needs 63% of GGC memory,
> and VRP  23%

>  tree VRP                :  65.88 ( 2%) usr   0.73 ( 2%) sys  66.71 
>( 2%) wall  862879 kB (24%) ggc

Is it possible to do this again with gathering statistics enabled? The
only thing I can think of for this would be ASSERT_EXPRs and all the
rewriting involved for them.


>  tree slp vectorization  :  25.42 ( 1%) usr   0.16 ( 0%) sys  25.50
> ( 1%) wall  436119 kB (12%) ggc

This 12% also seems excessive.


>  CPROP                   :  52.83 ( 2%) usr   0.33 ( 1%) sys  52.91
> ( 2%) wall  336514 kB ( 9%) ggc

And this one also.  I'll see if I can understand and explain this one.


>  integrated RA           : 191.13 ( 7%) usr   0.44 ( 1%) sys 190.64
> ( 7%) wall 2328880 kB (65%) ggc

Uh, wow! :-(
Comment 132 Jan Hubicka 2012-05-12 18:32:14 UTC
> >  tree VRP                :  65.88 ( 2%) usr   0.73 ( 2%) sys  66.71 
> >( 2%) wall  862879 kB (24%) ggc
> 
> Is it possible to do this again with gathering statistics enabled? The

I started it some time ago, but it takes a while (it runs out of RAM even
on my machine ;)

> only thing I can think of for this would be ASSERT_EXPRs and all the
> rewriting involved for them.

It also might be folding doing too much of temporary stuff.

> >  tree slp vectorization  :  25.42 ( 1%) usr   0.16 ( 0%) sys  25.50
> > ( 1%) wall  436119 kB (12%) ggc
> 
> This 12% also seems excessive.

Indeed it is.
> >  integrated RA           : 191.13 ( 7%) usr   0.44 ( 1%) sys 190.64
> > ( 7%) wall 2328880 kB (65%) ggc
> 
> Uh, wow! :-(

Tep, sems something degenerate here.  IRA is usually not that big of memory hog.

Honza
Comment 133 Jan Hubicka 2012-05-12 19:07:32 UTC
Another thing to observe is that GGC memory is "just" 4GB.  I am not sure where the other 8GB goes when our IL is believed
to be major memory consumer and it resists almost completely in GGC memory.

perhaps some of the streaming hashtables gets out of control.

Also it seems that line number info is about 1GB. It may be win to write better streaming of locations.
Current one enables almost no reuse of locators.

Honza
Comment 134 Jan Hubicka 2012-05-12 20:22:27 UTC
I tracked down the LTO/WHOPR code size difference. It is EH handling. EH frame is empty for LTO build and quite large for WHOPR.  Probably -fno-exceptions getting lots on way to ltrans?

With memory stats there don't seem to be major suprises:
tree-phinodes.c:129 (allocate_phi_node)           110246192: 0.8%          0: 0.0%    3405296: 0.1%     409376: 0.0%     372408
gimple.c:600 (gimple_build_nop)                   119935632: 0.8%          0: 0.0%     252144: 0.0%          0: 0.0%    2503912
gimplify.c:437 (create_tmp_var_raw)               119589760: 0.8%          0: 0.0%    1119200: 0.0%          0: 0.0%     754431
tree-vrp.c:3993 (build_assert_expr_for)           124663296: 0.9%          0: 0.0%          0: 0.0%          0: 0.0%    1298576
emit-rtl.c:3731 (make_jump_insn_raw)              118395600: 0.8%          0: 0.0%   11138960: 0.3%          0: 0.0%    1619182
tree-streamer-in.c:484 (streamer_alloc_tree)       90340024: 0.6%          0: 0.0%   51300472: 1.5%       4376: 0.0%    1420249
simplify-rtx.c:183 (simplify_gen_binary)          153607224: 1.1%          0: 0.0%     619968: 0.0%          0: 0.0%    6426133
fold-const.c:1870 (fold_convert_loc)              154700600: 1.1%          0: 0.0%       2160: 0.0%          0: 0.0%    3867569
ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw   80243272: 0.6% 1267966456:15.3%   76357960: 2.2%   11155352: 1.2%    1833025
lto/lto.c:281 (lto_read_in_decl_state)               835696: 0.0%          0: 0.0%  163487336: 4.6%   31116920: 3.4%    4176305
cfg.c:216 (connect_src)                           174302184: 1.2%     623048: 0.0%    7861944: 0.2%     133632: 0.0%    4542618
cfg.c:226 (connect_dest)                          177198328: 1.2%    5444688: 0.1%    8603432: 0.2%     347648: 0.0%    4628047
tree.c:9115 (make_vector_type)                    206615472: 1.4%          0: 0.0%       6720: 0.0%          0: 0.0%    1229894
emit-rtl.c:639 (gen_rtx_MEM)                      202133352: 1.4%          0: 0.0%    6629016: 0.2%          0: 0.0%    8698432
dwarf2cfi.c:386 (copy_cfi_row)                    212886640: 1.5%          0: 0.0%          0: 0.0%          0: 0.0%    1400570
tree-inline.c:4851 (copy_decl_no_change)          211988960: 1.5%          0: 0.0%    7283480: 0.2%          0: 0.0%    1387268
tree-ssanames.c:78 (init_ssanames)                224107008: 1.6%  252869632: 3.1%       1536: 0.0%  153516032:16.6%     309555
lists.c:144 (alloc_EXPR_LIST)                     236354400: 1.7%          0: 0.0%    5798160: 0.2%          0: 0.0%   10089690
gimple.c:2237 (gimple_copy)                       268995784: 1.9%          0: 0.0%    4002872: 0.1%     644208: 0.1%    2530798
gimple-streamer-in.c:95 (input_gimple_stmt)       272340080: 1.9%          0: 0.0%    4356168: 0.1%     917040: 0.1%    2550173
tree-inline.c:4331 (copy_tree_r)                  286698704: 2.0%          0: 0.0%    2053920: 0.1%          0: 0.0%    5999420
rtl.c:287 (copy_rtx)                              291942896: 2.0%          0: 0.0%     318864: 0.0%          0: 0.0%   12315136
emit-rtl.c:393 (gen_raw_REG)                      271761568: 1.9%          0: 0.0%   25188032: 0.7%          0: 0.0%    9279675
cselib.c:1896 (cselib_subst_to_values)            299291264: 2.1%          0: 0.0%          0: 0.0%          0: 0.0%   12658684
emit-rtl.c:5427 (init_emit)                       354914672: 2.5%   19547728: 0.2%          0: 0.0%  102897600:11.1%     132600
cgraph.c:359 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  401297520:11.4%          0: 0.0%    1286210
emit-rtl.c:3679 (make_insn_raw)                   435416472: 3.0%          0: 0.0%    1754496: 0.0%          0: 0.0%    6071819
fold-const.c:7624 (build_fold_addr_expr_with_typ  463283920: 3.2%          0: 0.0%      72880: 0.0%          0: 0.0%   11583920
tree-ssanames.c:141 (make_ssa_name_fn)            459164960: 3.2%          0: 0.0%    5805920: 0.2%          0: 0.0%    5812136
cfg.c:142 (alloc_block)                           469702464: 3.3%          0: 0.0%   20328672: 0.6%          0: 0.0%    4375278
toplev.c:964 (realloc_for_line_map)                       0: 0.0%  357908640: 4.3% 1073741848:30.4%        184: 0.0%          9
tree.c:1228 (build_int_cst_wide)                 1188738504: 8.3%          0: 0.0%   31478720: 0.9%  401175208:43.3%     295230
tree-streamer-in.c:495 (streamer_alloc_tree)     2413661896:16.9%          0: 0.0% 1163973288:32.9%   41183648: 4.4%   28110064
Total                                            14300758513       8262871404       3534486067        927547008        308001940
source location                                     Garbage            Freed             Leak         Overhead            Times

From explicitely freed GGC mem there are few interesting cases:
alias.c:2807 (init_alias_analysis)                        0: 0.0%  597580152: 7.2%          0: 0.0%  116629208:12.6%    1033104
reload1.c:663 (grow_reg_equivs)                           0: 0.0% 2244546880:27.2%          0: 0.0%    1859904: 0.2%     204226
tree-ssa-operands.c:331 (ssa_operand_alloc)               0: 0.0% 1326537728:16.1%       1024: 0.0%          0: 0.0%     299739
ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw   80243272: 0.6% 1267966456:15.3%   76357960: 2.2%   11155352: 1.2%    1833025


Heap vectors:

source location                                        Leak             Peak            Times
-------------------------------------------------------

ipa-reference.c:171 (set_reference_vars_info)             0: 0.0%   11240664              13: 0.0%
ipa-pure-const.c:236 (set_function_state)                 0: 0.0%   13472632          842964: 0.8%
ipa-inline-analysis.c:3010 (read_inline_edge_sum          0: 0.0%   17281356          870489: 0.8%
ipa-prop.c:136 (ipa_initialize_node_params)               0: 0.0%   29039016          666148: 0.6%
ipa-inline-analysis.c:804 (inline_summary_alloc)          0: 0.0%   30037064               1: 0.0%
ipa-prop.h:308 (ipa_check_create_node_params)             0: 0.0%   51448408               1: 0.0%
ipa-prop.h:313 (ipa_check_create_node_params)             0: 0.0%   51448448               1: 0.0%
....
tree-vect-slp.c:1553 (vect_analyze_slp_instance)      49136: 0.1%      80056            3273: 0.0%
tree-vect-slp.c:1521 (vect_analyze_slp_instance)      49256: 0.1%      80136            3273: 0.0%
tree-into-ssa.c:1049 (mark_phi_for_rewrite)           60776: 0.1%      71352              11: 0.0%
cfgloop.c:1151 (get_loop_exit_edges)                 310312: 0.6%     316976          310269: 0.3%
tree-into-ssa.c:291 (get_ssa_name_ann)               352928: 0.6%     612512              13: 0.0%
passes.c:2214 (execute_one_pass)                     934496: 1.7%   41942992          557113: 0.5%
tree-ssa-structalias.c:3861 (handle_lhs_call)       1491552: 2.6%    2359224           20716: 0.0%
ipa-inline-analysis.c:2645 (inline_merge_summary    2432148: 4.3%    2442960          157716: 0.1%
tree-ssa-loop-im.c:1556 (record_mem_ref_loc)        6634880:11.8%   10465232          595488: 0.6%
tree-ssa-loop-im.c:1545 (record_mem_ref_loc)        7587408:13.5%   12637232          579373: 0.5%
ipa-reference.c:186 (set_reference_optimization_   10289688:18.3%   11240664              13: 0.0%
lto-cgraph.c:118 (lto_cgraph_encoder_encode)       12756976:22.7%   23348152           25665: 0.0%
ipa-ref.c:55 (ipa_record_reference)                13164872:23.4%   41932432         1000598: 0.9%
Total                                              56309568                         107517917

I will try to look for ipa-ref related leaks... These should not outgrow other IPA structures, but they are not _that_ off.  

Bitmap                                     Overall       Allocated            Peak            Leak   searched   search itr
---------------------------------------------------------------------------------
df-problems.c:550 (df_rd_transfer_functio  1401668       550959000       285854280       285854280    1202920    2686239
df-problems.c:4368 (df_md_alloc)           2420865       119625200       103991640       103991640    7882560     876516
df-problems.c:4370 (df_md_alloc)           2420865        47313120        44242920        44242920          0          0
df-problems.c:4366 (df_md_alloc)           2420865        11779160        11744960        11744960          0          0
df-problems.c:4367 (df_md_alloc)           2420865        26404920        26403880        26403880     271729          4
tree-ssa-structalias.c:1249 (build_pred_g  2603931       225511920       225511920       225511920     187843     110177
tree-ssa-tail-merge.c:1316 (deps_ok_for_r   593970        30665680        16874760        16874760        632         40
tree-ssa-structalias.c:5890 (find_what_va  2328862       113793160       102564760       102564760     710275     853412
df-problems.c:1389 (df_live_alloc)         1806260        76241920        12459320        12459320       1826          0
df-problems.c:1390 (df_live_alloc)         1806260       281713360        38869560        38868680    2579692    1190624
df-problems.c:1392 (df_live_alloc)         1806260       991814240        40633200        40629040     221318     201166
dse.c:2452 (copy_fixed_regs)               1132737        90618960        90618960        90618960          0          0
df-problems.c:1391 (df_live_alloc)         1806260      1491519600        40632440        40628480     536753     522104
tree-ssa-loop-im.c:1512 (mem_ref_alloc)     567787        33164080        12373120        12372440          0          0
reload1.c:495 (new_insn_chain)             5276019       402655640       401709040       401709040      24691          0
tree-ssa-pre.c:619 (bitmap_set_new)       32638618       990092880       562280520       562280440   20419995   15879008
tree-ssa-pre.c:620 (bitmap_set_new)       32638618       990371960       574119360       574119280   16846876   10621314
df-problems.c:261 (df_rd_alloc)            2741972       138884160       129954960       129954960    2949744     610463
reload1.c:496 (new_insn_chain)             5276019       151328120       151029880       151029880     388762      10455
tree-ssa-structalias.c:2559 (solve_graph)  3169222       256948000       256292160       256292160          0          0
tree-ssanames.c:90 (init_ssanames)          309555        25951800        12382440        12382200   18777080    7410198
tree-ssa-structalias.c:2113 (label_visit)  5147637       425173040       425173040       425173040     105478      61601
tree-ssa-structalias.c:1108 (add_implicit  4593393       382459560       382459560       382459560     726652     628375
tree-ssa-structalias.c:1123 (add_pred_gra  3379786       273371640       273371640       273371640     121581      98415
tree-ssa-structalias.c:1144 (add_graph_ed  2917231       246071240       174844960       174844960     681820     290190
df-problems.c:262 (df_rd_alloc)            2741972       530288680       506786360       506786360          0          0
df-problems.c:263 (df_rd_alloc)            2741972       304266640       233174000       233172280        108        108
tree-ssa-structalias.c:361 (new_var_info)  7385339       467574280       360290520       360290520      44320      85263

Alloc-pool Kind         Elt size  Pools  Allocated (elts)            Peak (elts)            Leak (elts)
--------------------------------------------------------------------------------------------------------------
insn_info_pool             56     204084  538278104(   9612109)     830704(     14834)          0(         0)
bb_info_pool               56     204084  133331912(   2380927)     133616(      2386)          0(         0)
rtx_group_info_pool       112     204084   56406672(    503631)     138768(      1239)          0(         0)
Bitmap sets                80     204085 2611089440(  32638618)    8824880(    110311)          0(         0)
deferred_change_pool       24     204084      52128(      2172)        288(        12)          0(         0)
pre_expr nodes             16     204085  138421792(   8651362)     981200(     61325)          0(         0)
cse_store_info_pool       104    1972759   98188584(    944121)     485472(      4668)          0(         0)
value                      16     843341  462086672(  28880417)     245280(     15330)          0(         0)
VN phis                    32     408170   88913824(   2778557)      83712(      2616)          0(         0)
Constraint pool            32     204085  353203136(  11037598)     594528(     18579)          0(         0)
struct case_node pool      48       4743    1096848(     22851)      13680(       285)          0(         0)
Variable info pool         72     204085  531744408(   7385339)     601560(      8355)          0(         0)
IPA-CP value sources       32          1    4760736(    148773)    4260384(    133137)          0(         0)
et_occ pool                48    2116800 3595771776(  74911912)     688128(     14336)          0(         0)
VN references              56     408170  323302616(   5773261)    3466680(     61905)          0(         0)
et_node pool               64    2116800 2533145216(  39580394)     458880(      7170)          0(         0)
dep_node                   80     102042  734534240(   9181678)    4233840(     52923)          0(         0)
df_chain_block pool        16     251647  436908640(  27306790)    2391808(    149488)          0(         0)
IPA-CP values              80          1    5005280(     62566)    5005280(     62566)          0(         0)
df_scan ref base           56     204084 6325340840( 112952515)    2948400(     52650)          0(         0)
SRA accesses              120     102043   13514520(    112621)      92760(       773)          0(         0)
df_scan ref artificial     64     204084  901356672(  14083698)     899200(     14050)          0(         0)
df_scan ref regular        64     204084 2184845888(  34138217)    2431168(     37987)          0(         0)
allocnos                  160     102042  281957120(   1762232)    1250560(      7816)          0(         0)
elt_list                   16     843341  619139328(  38696208)     240832(     15052)          0(         0)
elt_loc_list               24     843341 1153775424(  48073976)     521760(     21740)          0(         0)
df_scan insn               48     204084  926799792(  19308329)    1070400(     22300)          0(         0)
live ranges                40     102042  106931600(   2673290)     508880(     12722)          0(         0)
df_scan reg                16     204084  934613472(  58413342)     783216(     48951)          0(         0)
SRA links                  24     102043     402672(     16778)       4848(       202)          0(         0)
rtx_store_info_pool       104     204084   19621264(    188666)     213096(      2049)          0(         0)
strinfo_struct pool        56     102042     324184(      5789)       1344(        24)          0(         0)
edge predicates            40          1    3540840(     88521)    2030280(     50757)          0(         0)
original_copy               8     509567    3890016(    486252)      13264(      1658)          0(         0)
cost vectors              192    2551050  192202512(   1001054)     419392(      2184)          0(         0)
operand entry pool         24     204084   18481680(    770070)      89424(      3726)          0(         0)
objects                    72     102042  126880704(   1762232)     562752(      7816)          0(         0)
deps_list                  16     102042  385122400(  24070150)     847120(     52945)          0(         0)
cselib_val_list            40     843341 1155216680(  28880417)     613200(     15330)          0(         0)
copies                     80     102042   27013920(    337674)     324480(      4056)          0(         0)
read_info_pool             32     204084   84871968(   2652249)      91104(      2847)          0(         0)

GIMPLE statements
Kind                   Stmts      Bytes
---------------------------------------
assignments          6803719  658739112
phi nodes             372408  112832736
conditionals         1121446  107658816
everything else      3704547  292211544

Kind                   Nodes      Bytes
---------------------------------------
decls                15883790 -1764091088
types                6197660 1041206880
blocks               1809846  144787680
stmts                  52888    3384832
refs                 11131010  561131416
exprs                31414309 1351944944
constants            2761315   97231060
identifiers          1227582   49103280
vecs                  295323  417871880
binfos               1420249  141631744
ssa names            5812136  464970880
constructors          340124    8162976
random kinds         3280618  131225128
lang_decl kinds            0          0
lang_type kinds            0          0
omp clauses                0          0
---------------------------------------
Total                81626850 -1646405684
Comment 135 Jan Hubicka 2012-05-12 21:33:36 UTC
... and mem reports on WPA stage:

toplev.c:964 (realloc_for_line_map)                       0: 0.0%   89473168: 9.4%  268435472:10.3%        160: 0.0%          8
cgraph.c:359 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  401297520:15.3%          0: 0.0%    1286210
tree.c:1228 (build_int_cst_wide)                 1188709752:33.7%          0: 0.0%   22765400: 0.9%  399425424:83.1%     208540
tree-streamer-in.c:495 (streamer_alloc_tree)     1950272016:55.3%          0: 0.0% 1143907104:43.7%   41182080: 8.6%   22462122
Total                                            3527995024        956449616       2618397893        480920037         47749265
source location                                     Garbage            Freed             Leak         Overhead            Times


So about 50% trees, 15% cgraph nodes (I do have plans how to get those smaller), 10% linemaps (I wonder if simple cache would not save a lot of locators), 5% inline summaries

I wonder who is producing that 1GB of temporary integer nodes? Someone abusing them for counting too much? It is there before IPA, so it seems to be streaming or type machinery.

Heap vectors:

source location                                        Leak             Peak            Times
-------------------------------------------------------

ipa-reference.c:186 (set_reference_optimization_   10289688:10.5%   11240664              13: 0.0%
lto-cgraph.c:118 (lto_cgraph_encoder_encode)       12756976:13.0%   23348152           26300: 0.2%
ipa-ref.c:55 (ipa_record_reference)                13593072:13.8%   41932432         1000565: 6.0%
passes.c:2214 (execute_one_pass)                   21214520:21.5%   41942992          557113: 3.3%
ipa-inline-analysis.c:804 (inline_summary_alloc)   30037064:30.5%   30037064               1: 0.0%
Total                                              98450004                          16768143

Bitmap                                     Overall       Allocated            Peak            Leak   searched   search itr
---------------------------------------------------------------------------------
ipa-reference.c:911 (propagate)             372741        31244280        31223720        31223720          0          0
ipa-reference.c:739 (propagate)             329258        13341680         3058960         3058960          0          0
ipa-reference.c:923 (propagate)             372186        25153920        25138520        25138520          0          0
ipa-reference.c:417 (init_function_info)    487263        19809560        19809560        19809560        551        335
ipa-reference.c:418 (init_function_info)    487263        19584680        19584680        19584680         79         45
ipa-reference.c:747 (propagate)             329351        13229360         3053920         3053920          0          0

Kind                   Nodes      Bytes
---------------------------------------
decls                11059354 1770384416
types                6163492 1035466656
blocks                     1         80
stmts                      0          0
refs                    5243     267944
exprs                1826905   74999944
constants            2198755   72290570
identifiers           538891   21555640
vecs                  208540  412624304
binfos               1420249  141631744
ssa names                111       8880
constructors          159169    3820056
random kinds         3270917  130837088

Honza
Comment 136 Jan Hubicka 2012-05-13 16:29:04 UTC
... and oprofile of compilation stage of -flto-partition=none
samples  %        image name               app name                 symbol name
194976    2.8536  lto1                     lto1                     alloc_page
109091    1.5966  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
99458     1.4556  lto1                     lto1                     operand_equal_p
88092     1.2893  lto1                     lto1                     record_reg_classes
87508     1.2807  lto1                     lto1                     bitmap_set_bit
75628     1.1069  lto1                     lto1                     estimate_edge_growth
68760     1.0064  lto1                     lto1                     mem_attrs_eq_p
62151     0.9096  lto1                     lto1                     for_each_rtx_1
58274     0.8529  libc-2.11.1.so           libc-2.11.1.so           memset
55257     0.8087  libc-2.11.1.so           libc-2.11.1.so           malloc
52116     0.7628  lto1                     lto1                     htab_find_slot_with_hash
50481     0.7388  oprofiled                oprofiled                /usr/bin/oprofiled
42524     0.6224  lto1                     lto1                     ggc_set_mark
40190     0.5882  lto1                     lto1                     constrain_operands
40124     0.5872  lto1                     lto1                     lookup_page_table_entry
39279     0.5749  lto1                     lto1                     extract_insn
34436     0.5040  lto1                     lto1                     ggc_internal_alloc_stat
33609     0.4919  lto1                     lto1                     preprocess_constraints
32843     0.4807  lto1                     lto1                     get_attr_enabled
32582     0.4769  lto1                     lto1                     reload_cse_simplify_operands
32573     0.4767  lto1                     lto1                     bitmap_clear_bit
32278     0.4724  libc-2.11.1.so           libc-2.11.1.so           malloc_consolidate
29633     0.4337  lto1                     lto1                     bitmap_bit_p
29593     0.4331  lto1                     lto1                     find_reg_note
29428     0.4307  libc-2.11.1.so           libc-2.11.1.so           _int_free
29161     0.4268  lto1                     lto1                     df_note_bb_compute
28939     0.4235  libc-2.11.1.so           libc-2.11.1.so           calloc
28794     0.4214  lto1                     lto1                     cse_insn
28084     0.4110  lto1                     lto1                     find_reloads
26192     0.3833  lto1                     lto1                     ix86_decompose_address
25211     0.3690  libc-2.11.1.so           libc-2.11.1.so           memcpy
25016     0.3661  lto1                     lto1                     df_ref_create_structure
24321     0.3560  lto1                     lto1                     nonzero_bits1
24066     0.3522  lto1                     lto1                     htab_traverse_noresize
23895     0.3497  libc-2.11.1.so           libc-2.11.1.so           free
Comment 137 Jan Hubicka 2012-08-10 15:06:51 UTC
So since the last report we managed to double WPA memory usage and compile time...
12m wall, 42m user is needed for WPA build.
Execution times (seconds)
 phase opt and generate  :  97.34 (21%) usr   0.33 ( 1%) sys  97.70 (20%) wall   98900 kB ( 3%) ggc
 phase stream in         : 242.70 (51%) usr   5.12 (22%) sys 247.94 (50%) wall 3174311 kB (97%) ggc
 phase stream out        : 131.99 (28%) usr  17.49 (76%) sys 149.59 (30%) wall       0 kB ( 0%) ggc
 garbage collection      :  24.01 ( 5%) usr   0.00 ( 0%) sys  24.03 ( 5%)  ipa lto gimple out      :  12.59 ( 3%) usr   1.07 ( 5%) sys  13.69 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto decl in         : 188.50 (40%) usr   3.93 (17%) sys 192.53 (39%) wall 2083552 kB (64%) ggc
 ipa lto decl out        : 113.33 (24%) usr   8.48 (37%) sys 121.84 (25%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   5.58 ( 1%) usr   0.67 ( 3%) sys   6.25 ( 1%) wall  684122 kB (21%) ggc
 ipa lto decl merge      :  10.64 ( 2%) usr   0.01 ( 0%) sys  10.64 ( 2%) wall     291 kB ( 0%) ggc
 ipa lto cgraph merge    :   9.15 ( 2%) usr   0.01 ( 0%) sys   9.17 ( 2%) wall   15100 kB ( 0%) ggc
 whopr wpa               :   5.80 ( 1%) usr   0.05 ( 0%) sys   5.89 ( 1%) wall       1 kB ( 0%) ggc
 whopr wpa I/O           :   2.19 ( 0%) usr   7.94 (35%) sys  10.19 ( 2%)  inline heuristics       :  61.46 (13%) usr   0.31 ( 1%) sys  61.80 (12%) wall  351753 kB (11%) ggc
 callgraph verifier      :  15.97 ( 3%) usr   0.06 ( 0%) sys  16.00 ( 3%) wall       0 kB ( 0%) ggc
 TOTAL                 : 472.05            22.94           495.25            3274649 kB
Comment 138 Jan Hubicka 2012-08-10 15:35:44 UTC
Actually not, I looked up wrong report. The last report in comment #121 shows:
TOTAL                 : 616.43            22.26           651.79           
2165706 kB

So we actually got noticeably faster, but need more memory. 1GB of GGC space, but a lot more in top report.  I will look into mem report analysis once I am done with merging some other cleanups/speedups.
Comment 139 Jan Hubicka 2012-08-18 09:36:55 UTC
oprofile of WPA:
649295   18.2243  lto1                     lto1                     lto_main()
341256    9.5783  lto1                     lto1                     htab_find_slot_with_hash
126567    3.5525  lto1                     lto1                     do_estimate_growth_1(cgraph_node*, void*)
97142     2.7266  lto1                     lto1                     htab_expand
89658     2.5165  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
82117     2.3048  lto1                     lto1                     pointer_map_insert(pointer_map_t*, void const*)
60238     1.6907  lto1                     lto1                     iterative_hash_hashval_t(unsigned int, unsigned int)
58145     1.6320  lto1                     lto1                     ggc_internal_alloc_stat(unsigned long, char const*, int, char const*)
53679     1.5067  lto1                     lto1                     linemap_lookup(line_maps*, unsigned int)
47271     1.3268  lto1                     lto1                     lto_output_tree(output_block*, tree_node*, bool, bool)
43043     1.2081  lto1                     lto1                     gt_ggc_mx_lang_tree_node(void*)
42675     1.1978  lto1                     lto1                     verify_cgraph_node(cgraph_node*)
40609     1.1398  lto1                     lto1                     streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned int*, bool)
40245     1.1296  lto1                     lto1                     ggc_marked_p(void const*)
39474     1.1079  libc-2.11.1.so           libc-2.11.1.so           memset
38955     1.0934  libc-2.11.1.so           libc-2.11.1.so           malloc_consolidate
32085     0.9006  lto1                     lto1                     streamer_write_uhwi_stream(lto_output_stream*, unsigned long)
31965     0.8972  lto1                     lto1                     ggc_set_mark(void const*)
31406     0.8815  lto1                     lto1                     lto_input_tree(lto_input_block*, data_in*)
29213     0.8199  lto1                     lto1                     streamer_read_tree_bitfields(lto_input_block*, tree_node*)
26846     0.7535  lto1                     lto1                     hash_pointer
25870     0.7261  libc-2.11.1.so           libc-2.11.1.so           memcpy


We still spend insanely long time in walking types in lto_main (introduced by Michael's patch)
Comment 140 Jan Hubicka 2012-08-19 05:55:26 UTC
Author: hubicka
Date: Sun Aug 19 05:55:20 2012
New Revision: 190509

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190509
Log:

	PR lto/45375
	* ipa-inline.c (want_inline_small_function_p): Bypass
	inline limits for hinted functions.
	(edge_badness): Dump hints; decrease badness for hinted funcitons.
	* ipa-inline.h (enum inline_hints_vals): New enum.
	(inline_hints): New type.
	(edge_growth_cache_entry): Add hints.
	(dump_inline_summary): Update.
	(dump_inline_hints): Declare.
	(do_estimate_edge_hints): Declare.
	(estimate_edge_hints): New inline function.
	(reset_edge_growth_cache): Update.
	* predict.c (cgraph_maybe_hot_edge_p): Do not ice on indirect edges.
	* ipa-inline-analysis.c (dump_inline_hints): New function.
	(estimate_edge_devirt_benefit): Return true when function should be
	hinted.
	(estimate_calls_size_and_time): New hints argument; set it when
	devritualization happens.
	(estimate_node_size_and_time): New hints argument.
	(do_estimate_edge_time): Cache hints.
	(do_estimate_edge_growth): Update.	
	(do_estimate_edge_hints): New function

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/predict.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/ipa/iinline-1.c
Comment 141 Markus Trippelsdorf 2012-09-15 14:05:38 UTC
After the new IonMonkey JIT went in (http://blog.mozilla.org/javascript/2012/09/12/ionmonkey-in-firefox-18/) 
peak memory use went up. It is now 6.8GB (gcc-4.7 roughly the same: 6.5GB).
So we're approaching the point where a 8GB machine isn't enough to
build Firefox with LTO...
Comment 142 Jan Hubicka 2012-10-08 22:19:55 UTC
After updating Mozilla this weekend, I definitely bloat up 8GB machine. The pak in TOP is around 9-10GB.  I checked malloc usage and there are not many surprises. It is about 300MB, mostly GGC overhead, pointer maps and such.

Most memory is actually the GGC, about 7GB. Here 5GB survives type and decl merging and is distributed as follows:
cgraph.c:722 (cgraph_allocate_init_indirect_info    1671240: 0.0%          0: 0.0%    8202960: 0.2%          0: 0.0%     246855
tree.c:1226 (build_int_cst_wide)                  625825208:12.3%          0: 0.0%   10437744: 0.2%    4863752: 3.1%     325009
ipa-prop.h:471 (ipa_check_create_edge_args)               0: 0.0%          0: 0.0%   16777216: 0.3%          0: 0.0%          1
ipa-inline-analysis.c:3697 (inline_read_section)          0: 0.0%   28298904: 1.6%   21095504: 0.4%    1064480: 0.7%     423701
tree.c:1561 (build_string)                         16526800: 0.3%          0: 0.0%   21695715: 0.4%    3395427: 2.2%     864326
ipa-prop.c:3393 (ipa_read_node_info)                      0: 0.0%    4302088: 0.2%   25029448: 0.5%     119192: 0.1%     246788
stringpool.c:75 (alloc_node)                              0: 0.0%          0: 0.0%   27817760: 0.5%          0: 0.0%     695444
ipa-ref.c:51 (ipa_record_reference)                       0: 0.0%  188442816:10.3%   28443272: 0.6%    2114424: 1.4%    1256259
stringpool.c:58 (stringpool_ggc_alloc)                    0: 0.0%          0: 0.0%   34673092: 0.7%    2619412: 1.7%     695444
lto/lto.c:2279 (create_subid_section_table)          275832: 0.0%          0: 0.0%   40363416: 0.8%    8051472: 5.2%       3978
tree-streamer-in.c:895 (lto_input_ts_constructor  171812232: 3.4%  192568640:10.6%   42205992: 0.8%    1425072: 0.9%     947082
ipa-prop.c:3380 (ipa_read_node_info)                      0: 0.0%   35825488: 2.0%   58764528: 1.1%     659704: 0.4%     909232
tree-streamer-in.c:488 (streamer_alloc_tree)      129846168: 2.6%          0: 0.0%   75997752: 1.5%       7072: 0.0%    2063753
tree.c:1263 (build_int_cst_wide)                  237791264: 4.7%          0: 0.0%   90464320: 1.8%          0: 0.0%   10257987
ipa-inline-analysis.c:3709 (inline_read_section)          0: 0.0%  133938484: 7.4%  101874268: 2.0%    1606480: 1.0%    1099389
lto-section-in.c:361 (lto_new_in_decl_state)           3240: 0.0%          0: 0.0%  107452560: 2.1%          0: 0.0%     895465
cgraph.c:653 (cgraph_create_edge_1)                       0: 0.0%          0: 0.0%  135509816: 2.6%          0: 0.0%    1302979
ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw       2040: 0.0%  866397160:47.6%  190623368: 3.7%     263888: 0.2%      11459
lto/lto.c:267 (lto_read_in_decl_state)                 3024: 0.0%          0: 0.0%  225743280: 4.4%   41057176:26.5%    6268255
ipa-inline-analysis.c:931 (inline_summary_alloc)          0: 0.0%          0: 0.0%  268435464: 5.2%          8: 0.0%          1
cgraph.c:362 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  515473640:10.1%          0: 0.0%    1741465
toplev.c:953 (realloc_for_line_map)                       0: 0.0%  358955168:19.7% 1074790424:21.0%        184: 0.0%         19
tree-streamer-in.c:499 (streamer_alloc_tree)     3668091656:72.1%          0: 0.0% 1995384408:38.9%   87485792:56.5%   46580224
Total                                            5089831352       1821058652       5124870115        154815271         91384962
source location                                     Garbage            Freed             Leak         Overhead            Times

I.e. 20% are now linemaps, 38% trees read by the streamer, 10% cgraph nodes, 5% inline summaries, 4% streamer table converting UIDs to decls (that can be freed).

The trees are distributed as follows:
Kind                   Nodes      Bytes
---------------------------------------
decls                20489087 -1105370640
types                10321297 1733977896
blocks                102012    8160960
stmts                      0          0
refs                   44297    1806000
exprs                8205133  264995952
constants            11667038  376994197
identifiers           695444   27817760
vecs                  325009  626535448
binfos               2063753  205829776
ssa names                  0          0
constructors          369886    8877264
random kinds         7039351  281574472
lang_decl kinds            0          0
lang_type kinds            0          0
omp clauses                0          0
---------------------------------------
Total                61322307 -1863768211
---------------------------------------
Code                   Nodes

I think all the blocks read to WPA are bugs.  We may also do better on sharing constants.
----------------------------
identifier_node       695444
tree_list            7039346
tree_vec              325009
block                 102012
offset_type             1762
enumeral_type         371554
boolean_type            7097
integer_type          830019
real_type              10054
pointer_type         3089539
reference_type        215629
array_type            204968
record_type          3818337
union_type             77106
void_type               1478
function_type         259759
method_type          1433688
integer_cst          10784917
real_cst               17553
string_cst            864326
function_decl        2736272
label_decl             82077
field_decl           3121989
var_decl              323843
const_decl           2817588
parm_decl            5244428
type_decl            4906573
result_decl          1225435
constructor           369886
pointer_plus_expr     302600
nop_expr             3307128
addr_expr            4592681
tree_binfo           2063753

Honza
Comment 143 Steven Bosscher 2012-10-08 22:30:20 UTC
Created attachment 28395 [details]
Use size_t for tree code book-keeping

...because overflow looks so sloppy.
Comment 144 Markus Trippelsdorf 2012-12-01 12:39:30 UTC
It looks like there is a LTO code-size regression on trunk:
(size of libxul.so, build without elfhack):

gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9%
gcc         : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6%
clang lto   : size: 35071848 | Kraken bench: 2804.2ms +/- 1.2%
clang       : size: 36797384 | Kraken bench: 2819.6ms +/- 1.4%
Comment 145 Jan Hubicka 2012-12-01 22:09:07 UTC
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> 
> --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC ---
> It looks like there is a LTO code-size regression on trunk:
> (size of libxul.so, build without elfhack):
> 
> gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9%

About LTO+PGO please be sure that you have the Teresa's fix from this Friday in your tree.

> gcc         : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6%

Is LTO w/o PGO bigger than previous builds?

> clang lto   : size: 35071848 | Kraken bench: 2804.2ms +/- 1.2%
> clang       : size: 36797384 | Kraken bench: 2819.6ms +/- 1.4%
Comment 146 Markus Trippelsdorf 2012-12-02 07:36:02 UTC
(In reply to comment #145)
> > 
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> > 
> > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC ---
> > It looks like there is a LTO code-size regression on trunk:
> > (size of libxul.so, build without elfhack):
> > 
> > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9%
> 
> About LTO+PGO please be sure that you have the Teresa's fix from this Friday in
> your tree.

Yes, my tree already included this fix and also the fix from bug 55551.

> > gcc         : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6%
> 
> Is LTO w/o PGO bigger than previous builds?

Couldn't tell, because it doesn't link:

/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: hidden symbol 'pixman_add_triangles' in /var/tmp/moz-build-dir/toolkit/library/../../gfx/cairo/libpixman/src/pixman-trap.o is referenced by DSO /usr/lib64/libcairo.so
/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/cc0oq4BG.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' which may overflow at runtime; recompile with -fPIC
/tmp/cc0oq4BG.ltrans0.ltrans.o:cc0oq4BG.ltrans0.o:function SharedStub: error: undefined reference to 'PrepareAndDispatch'
/tmp/cc0oq4BG.ltrans1.ltrans.o:cc0oq4BG.ltrans1.o:function SkAnnotation::CreateProc(SkFlattenableReadBuffer&) [clone .local.7828.1055099]: error: undefined reference to 'SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&)'
collect2: error: ld returned 1 exit status

The undefined reference to PrepareAndDispatch is easily fixed by
an __attribute__ ((used)).
Do you have an idea on how to fix the SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&) issue?
Comment 147 Jan Hubicka 2012-12-02 09:23:09 UTC
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> 
> --- Comment #146 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-02 07:36:02 UTC ---
> (In reply to comment #145)
> > > 
> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> > > 
> > > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC ---
> > > It looks like there is a LTO code-size regression on trunk:
> > > (size of libxul.so, build without elfhack):
> > > 
> > > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9%
> > 
> > About LTO+PGO please be sure that you have the Teresa's fix from this Friday in
> > your tree.
> 
> Yes, my tree already included this fix and also the fix from bug 55551.

Please try to reduce HOT_BB_COUNT_WS_PERMILLE to 990. I also see some regressions
on some SPEC benchmarks (such as GCC) and this helps. If it doesn't it would be nice
to know what value is needed for comparable size.
> 
> > > gcc         : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6%
> > 
> > Is LTO w/o PGO bigger than previous builds?
> 
> Couldn't tell, because it doesn't link:
> 
> /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld:
> warning: hidden symbol 'pixman_add_triangles' in
> /var/tmp/moz-build-dir/toolkit/library/../../gfx/cairo/libpixman/src/pixman-trap.o
> is referenced by DSO /usr/lib64/libcairo.so
> /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld:
> error: /tmp/cc0oq4BG.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc
> against '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' which may overflow at
> runtime; recompile with -fPIC
> /tmp/cc0oq4BG.ltrans0.ltrans.o:cc0oq4BG.ltrans0.o:function SharedStub: error:
> undefined reference to 'PrepareAndDispatch'
> /tmp/cc0oq4BG.ltrans1.ltrans.o:cc0oq4BG.ltrans1.o:function
> SkAnnotation::CreateProc(SkFlattenableReadBuffer&) [clone .local.7828.1055099]:
> error: undefined reference to
> 'SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&)'
> collect2: error: ld returned 1 exit status
> 
> The undefined reference to PrepareAndDispatch is easily fixed by
> an __attribute__ ((used)).
> Do you have an idea on how to fix the
> SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&) issue?

Hmm, I remember seeing this one, too.  I will check.

Honza
Comment 148 Markus Trippelsdorf 2012-12-02 11:57:27 UTC
(In reply to comment #147)
> > 
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> > 
> > --- Comment #146 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-02 07:36:02 UTC ---
> > (In reply to comment #145)
> > > > 
> > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> > > > 
> > > > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC ---
> > > > It looks like there is a LTO code-size regression on trunk:
> > > > (size of libxul.so, build without elfhack):
> > > > 
> > > > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9%
> > > 
> > > About LTO+PGO please be sure that you have the Teresa's fix from this Friday in
> > > your tree.
> > 
> > Yes, my tree already included this fix and also the fix from bug 55551.
> 
> Please try to reduce HOT_BB_COUNT_WS_PERMILLE to 990. I also see some
> regressions
> on some SPEC benchmarks (such as GCC) and this helps. If it doesn't it would be
> nice to know what value is needed for comparable size.

Unfortunately it doesn't help much, because with "--param hot-bb-count-ws-permille=990" the size is only 0.25% smaller:
(With --param) : 42098856
(Without     ) : 42204584

I will try smaller values later.
Comment 149 Jan Hubicka 2012-12-02 15:05:52 UTC
> > Please try to reduce HOT_BB_COUNT_WS_PERMILLE to 990. I also see some
> > regressions
> > on some SPEC benchmarks (such as GCC) and this helps. If it doesn't it would be
> > nice to know what value is needed for comparable size.
> 
> Unfortunately it doesn't help much, because with "--param
> hot-bb-count-ws-permille=990" the size is only 0.25% smaller:
> (With --param) : 42098856
> (Without     ) : 42204584
> 
> I will try smaller values later.

Hmm, that sounds like quite bad news - the histogram code was supposed to help
in such cases.  I will try to fix the non-PGO case and lets try to compare how
PGO/non-PGO compare first.  If you could put somewhere the -fdump-ipa-inline
dump, I will try to check if there is something obviously wrong.

In worst case we can resort to combining both heuristics - i.e. keeping the
hot_bb_fraction in addition to histogram code. In fact I planned to do that this
way but Teresa removed the old code and I did not see good reason why to keep it.

Honza
Comment 150 Markus Trippelsdorf 2012-12-02 18:03:28 UTC
For comparison I've just disabled skia and build with LTO only;
the size looks good for this case: 31356968
Comment 151 Jan Hubicka 2012-12-02 20:52:13 UTC
Teresa comitted another bugfix just today. So with bit of luck it will work now?
I will try to look deeper into it ASAP, but I am just getting ready for trip to USA.

Honza
Comment 152 Jan Hubicka 2012-12-02 21:09:24 UTC
Also I suppose you don't have comparsion to 4.7 handy? (I am curious because of inliner heuristic re-tunning)

Honza
Comment 153 Markus Trippelsdorf 2012-12-02 21:13:21 UTC
On 2012.12.02 at 21:09 +0000, hubicka at ucw dot cz wrote:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> 
> --- Comment #152 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-02 21:09:24 UTC ---
> Also I suppose you don't have comparsion to 4.7 handy? (I am curious because of
> inliner heuristic re-tunning)

The LTO/PGO sizes were measured with the newest patch from Teresa
already applied.

gcc-4.7 lto/pgo: size: 33337456 | Kraken bench: 2706.7ms +/- 1.1%
Comment 154 Teresa Johnson 2012-12-11 19:30:53 UTC
What was the size of the gcc lto/pgo binary before the change to use the histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a very large increase, ~25%.

Markus, could you attach to the bug one of the gcda files so that I can see the program summary and figure out how far off the old hot bb threshold is from the new histogram-based one? Also, it would be good to see the -fdump-ipa-inline dumps before and after the regression (if necessary, the before one could be from 4_7).
Comment 155 Markus Trippelsdorf 2012-12-11 22:57:14 UTC
(In reply to comment #154)
> What was the size of the gcc lto/pgo binary before the change to use the
> histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a
> very large increase, ~25%.

With revision 193914 (before the change) the lto/pgo size is 42115424 bytes.
So it looks like Theresa is off the hook.

> Markus, could you attach to the bug one of the gcda files so that I can see the
> program summary and figure out how far off the old hot bb threshold is from the
> new histogram-based one? Also, it would be good to see the -fdump-ipa-inline
> dumps before and after the regression (if necessary, the before one could be
> from 4_7).

Will try to post them tomorrow .
Comment 156 Teresa Johnson 2012-12-12 00:00:17 UTC
On Tue, Dec 11, 2012 at 2:57 PM, markus at trippelsdorf dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
>
> --- Comment #155 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-11 22:57:14 UTC ---
> (In reply to comment #154)
>> What was the size of the gcc lto/pgo binary before the change to use the
>> histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a
>> very large increase, ~25%.
>
> With revision 193914 (before the change) the lto/pgo size is 42115424 bytes.
> So it looks like Theresa is off the hook.

Unfortunately, I am still possibly on the hook since the main suspect
change is r193747 (committed by Honza with changes made by him and I
to use the histogram instead of a hard limit for determining bb
hotness). Between then and when I committed fixes for this under LTO
(r193999) I would expect that the code size might have been worse
temporarily because everything looked hot since the histogram was not
being streamed through the LTO files properly, and so inlining could
have gotten excessive.

>
>> Markus, could you attach to the bug one of the gcda files so that I can see the
>> program summary and figure out how far off the old hot bb threshold is from the
>> new histogram-based one? Also, it would be good to see the -fdump-ipa-inline
>> dumps before and after the regression (if necessary, the before one could be
>> from 4_7).
>
> Will try to post them tomorrow .

Ok thanks.
Teresa

>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.



--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Comment 157 Markus Trippelsdorf 2012-12-12 11:43:27 UTC
With revision 193740 libxul's size is ~34MB, which is OK.

(Unfortunately this new ICE happens with yesterdays gcc when linking libxul:

/var/tmp/mozilla-central/content/base/src/nsDocument.cpp: In member function ‘CreateRange’:
/var/tmp/mozilla-central/content/base/src/nsDocument.cpp:4999:0: internal compiler error: in cgraph_mark_address_taken_node, at cgraph.c:1409

I will open a new PR for this later.)

Here are the requested files:

(I don't know which of the ~3000 gcda files you need, so I've uploaded them all)
http://www.trippelsdorf.de/gcda_before.tar.bz2 (4MB)
http://www.trippelsdorf.de/gcda_after.tar.bz2  (4MB)

(-fdump-ipa-inline output)
http://www.trippelsdorf.de/libxul_before.inline.tar.bz2 (100MB)
http://www.trippelsdorf.de/libxul_after.inline.tar.bz2  (68MB, everything 'till the ICE hit)
Comment 158 Teresa Johnson 2012-12-12 18:59:56 UTC
On Wed, Dec 12, 2012 at 3:43 AM, markus at trippelsdorf dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
>
> --- Comment #157 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-12 11:43:27 UTC ---
> With revision 193740 libxul's size is ~34MB, which is OK.
>
> (Unfortunately this new ICE happens with yesterdays gcc when linking libxul:
>
> /var/tmp/mozilla-central/content/base/src/nsDocument.cpp: In member function
> ‘CreateRange’:
> /var/tmp/mozilla-central/content/base/src/nsDocument.cpp:4999:0: internal
> compiler error: in cgraph_mark_address_taken_node, at cgraph.c:1409
>
> I will open a new PR for this later.)
>
> Here are the requested files:
>
> (I don't know which of the ~3000 gcda files you need, so I've uploaded them
> all)
> http://www.trippelsdorf.de/gcda_before.tar.bz2 (4MB)
> http://www.trippelsdorf.de/gcda_after.tar.bz2  (4MB)

Sorry, I should have clarified that any one of them would do (as long
as it corresponded to an object file included in the LTO link for the
main executable), since the info I need is in the program summary
section for the executable, which is duplicated in each of them.

>
> (-fdump-ipa-inline output)
> http://www.trippelsdorf.de/libxul_before.inline.tar.bz2 (100MB)
> http://www.trippelsdorf.de/libxul_after.inline.tar.bz2  (68MB, everything 'till
> the ICE hit)

With the old heuristics, the hot bb cutoff was:
                profile_info->sum_max / PARAM_VALUE (HOT_BB_COUNT_FRACTION))

In this case, sum_max is 103439951 and HOT_BB_COUNT_FRACTION was
10000, so the cutoff count was 10343.

From the working set computed from the histogram, the 99.9% cutoff
count is 320. See the end of this email for the full set of histograms
and working sets, but here are the top few working sets:

...
hal/Hal.gcda:           96.72%: num counts=30069, min counter=16389
hal/Hal.gcda:           97.50%: num counts=35296, min counter=10241
hal/Hal.gcda:           98.28%: num counts=43669, min counter=6145
hal/Hal.gcda:           99.06%: num counts=59589, min counter=3072
hal/Hal.gcda:           99.90%: num counts=115840, min counter=320

So it looks like you would want a cutoff of 97.5% to get close to what
was there before.

(Honza, I just made some changes to enable gcov-dump to optionally
compute and dump out the working sets from the histogram. I can send
this for upstream review as I have wanted this several times.)

The much smaller cutoff count is why there are fewer calls marked
unlikely and more inlining:

$ grep "call is unlikely" before/libxul.so.wpa.049i.inline  | wc
 442342 4944522 42560600

$ grep "call is unlikely" after/libxul.so.wpa.049i.inline  | wc
 392683 4349335 37477001

$ grep Inlined before/libxul.so.wpa.049i.inline  | grep eliminated
Inlined 60432 calls, eliminated 30986 functions

$ grep Inlined after/libxul.so.wpa.049i.inline  | grep eliminated
Inlined 89573 calls, eliminated 28921 functions

On thing that is interesting in the above info, and may be
contributing to the larger size now, is that there are more inlines,
but fewer functions are being eliminated. I'm not sure why that is
offhand. It's possible (probable) that inlining heuristics need some
retuning to make optimal use of the new cutoffs.

We also see additional inlines in some of our large internal apps with
the change, but not much increase in binary size, and it sometimes
leads to better performance - although we are not as much affected
because the google branches were using a much larger
HOT_BB_COUNT_FRACTION of 60K already, in order to get more inlining.
In this case, it looks like you are getting more inlines but it is
apparently performance-neutral?

Looking at a graph of the working set data, the number of counters
starts increasing super-exponentially as the percentages approach
100%. I've been thinking that it may be useful to find the "knee" of
the curve to determine the appropriate cutoff percentage. I'll see if
I can make some progress on that.

Full histogram/working set data:

hal/Hal.gcda: a3000000: 512:PROGRAM_SUMMARY checksum=0x3aa34521
hal/Hal.gcda: counts=2109045, runs=7, sum_all=9749748271,
run_max=97136704, sum_max=103439951
hal/Hal.gcda: counter histogram:
hal/Hal.gcda: 0: num counts=1824318, min counter=0, cum_counter=0
hal/Hal.gcda: 1: num counts=30727, min counter=1, cum_counter=30727
hal/Hal.gcda: 2: num counts=11646, min counter=2, cum_counter=23292
hal/Hal.gcda: 3: num counts=5414, min counter=3, cum_counter=16242
hal/Hal.gcda: 4: num counts=5156, min counter=4, cum_counter=20624
hal/Hal.gcda: 5: num counts=3379, min counter=5, cum_counter=16895
hal/Hal.gcda: 6: num counts=3674, min counter=6, cum_counter=22044
hal/Hal.gcda: 7: num counts=2310, min counter=7, cum_counter=16170
hal/Hal.gcda: 8: num counts=4756, min counter=8, cum_counter=40330
hal/Hal.gcda: 9: num counts=4725, min counter=10, cum_counter=49265
hal/Hal.gcda: 10: num counts=4256, min counter=12, cum_counter=52450
hal/Hal.gcda: 11: num counts=3424, min counter=14, cum_counter=49760
hal/Hal.gcda: 12: num counts=4936, min counter=16, cum_counter=86713
hal/Hal.gcda: 13: num counts=4025, min counter=20, cum_counter=86217
hal/Hal.gcda: 14: num counts=5271, min counter=24, cum_counter=134994
hal/Hal.gcda: 15: num counts=3052, min counter=28, cum_counter=89797
hal/Hal.gcda: 16: num counts=6812, min counter=32, cum_counter=241575
hal/Hal.gcda: 17: num counts=6269, min counter=40, cum_counter=274778
hal/Hal.gcda: 18: num counts=5652, min counter=48, cum_counter=289677
hal/Hal.gcda: 19: num counts=4240, min counter=56, cum_counter=253391
hal/Hal.gcda: 20: num counts=8321, min counter=64, cum_counter=592920
hal/Hal.gcda: 21: num counts=5824, min counter=80, cum_counter=508559
hal/Hal.gcda: 22: num counts=4846, min counter=96, cum_counter=497364
hal/Hal.gcda: 23: num counts=4014, min counter=112, cum_counter=478449
hal/Hal.gcda: 24: num counts=6460, min counter=128, cum_counter=919926
hal/Hal.gcda: 25: num counts=5253, min counter=160, cum_counter=916231
hal/Hal.gcda: 26: num counts=4072, min counter=192, cum_counter=844827
hal/Hal.gcda: 27: num counts=3544, min counter=224, cum_counter=850637
hal/Hal.gcda: 28: num counts=6143, min counter=256, cum_counter=1750280
hal/Hal.gcda: 29: num counts=4690, min counter=320, cum_counter=1648174
hal/Hal.gcda: 30: num counts=3864, min counter=384, cum_counter=1614077
hal/Hal.gcda: 31: num counts=3377, min counter=448, cum_counter=1616477
hal/Hal.gcda: 32: num counts=5986, min counter=512, cum_counter=3426093
hal/Hal.gcda: 33: num counts=4449, min counter=640, cum_counter=3100174
hal/Hal.gcda: 34: num counts=5339, min counter=768, cum_counter=4479538
hal/Hal.gcda: 35: num counts=3402, min counter=896, cum_counter=3264788
hal/Hal.gcda: 36: num counts=6139, min counter=1024, cum_counter=7017454
hal/Hal.gcda: 37: num counts=4224, min counter=1280, cum_counter=5931630
hal/Hal.gcda: 38: num counts=3957, min counter=1536, cum_counter=6576291
hal/Hal.gcda: 39: num counts=2747, min counter=1792, cum_counter=5236457
hal/Hal.gcda: 40: num counts=4640, min counter=2048, cum_counter=10611270
hal/Hal.gcda: 41: num counts=3733, min counter=2560, cum_counter=10510163
hal/Hal.gcda: 42: num counts=3079, min counter=3072, cum_counter=10242287
hal/Hal.gcda: 43: num counts=2651, min counter=3584, cum_counter=10140728
hal/Hal.gcda: 44: num counts=4434, min counter=4096, cum_counter=20361262
hal/Hal.gcda: 45: num counts=3987, min counter=5121, cum_counter=22720940
hal/Hal.gcda: 46: num counts=2943, min counter=6145, cum_counter=19504640
hal/Hal.gcda: 47: num counts=2334, min counter=7169, cum_counter=17826112
hal/Hal.gcda: 48: num counts=2817, min counter=8193, cum_counter=25598488
hal/Hal.gcda: 49: num counts=2779, min counter=10241, cum_counter=31417188
hal/Hal.gcda: 50: num counts=3033, min counter=12290, cum_counter=40410833
hal/Hal.gcda: 51: num counts=1853, min counter=14340, cum_counter=28478565
hal/Hal.gcda: 52: num counts=2655, min counter=16389, cum_counter=48690364
hal/Hal.gcda: 53: num counts=2445, min counter=20488, cum_counter=55375590
hal/Hal.gcda: 54: num counts=1691, min counter=24592, cum_counter=44944827
hal/Hal.gcda: 55: num counts=1436, min counter=28719, cum_counter=44036063
hal/Hal.gcda: 56: num counts=2533, min counter=32825, cum_counter=92560194
hal/Hal.gcda: 57: num counts=1974, min counter=41047, cum_counter=88298216
hal/Hal.gcda: 58: num counts=1635, min counter=49329, cum_counter=86653692
hal/Hal.gcda: 59: num counts=1131, min counter=57610, cum_counter=69796538
hal/Hal.gcda: 60: num counts=1638, min counter=65856, cum_counter=120165995
hal/Hal.gcda: 61: num counts=1227, min counter=82393, cum_counter=110414350
hal/Hal.gcda: 62: num counts=1420, min counter=98946, cum_counter=152171465
hal/Hal.gcda: 63: num counts=726, min counter=115741, cum_counter=89865259
hal/Hal.gcda: 64: num counts=1249, min counter=132608, cum_counter=184646974
hal/Hal.gcda: 65: num counts=862, min counter=165900, cum_counter=156618404
hal/Hal.gcda: 66: num counts=930, min counter=198695, cum_counter=199922412
hal/Hal.gcda: 67: num counts=628, min counter=232660, cum_counter=156498665
hal/Hal.gcda: 68: num counts=1136, min counter=266317, cum_counter=338816591
hal/Hal.gcda: 69: num counts=736, min counter=333978, cum_counter=267217317
hal/Hal.gcda: 70: num counts=589, min counter=401495, cum_counter=256810939
hal/Hal.gcda: 71: num counts=431, min counter=469085, cum_counter=216371731
hal/Hal.gcda: 72: num counts=581, min counter=536827, cum_counter=351453204
hal/Hal.gcda: 73: num counts=387, min counter=672090, cum_counter=287503062
hal/Hal.gcda: 74: num counts=345, min counter=811897, cum_counter=302673649
hal/Hal.gcda: 75: num counts=246, min counter=951474, cum_counter=250577118
hal/Hal.gcda: 76: num counts=315, min counter=1084378, cum_counter=382079125
hal/Hal.gcda: 77: num counts=224, min counter=1362634, cum_counter=336536846
hal/Hal.gcda: 78: num counts=142, min counter=1643302, cum_counter=252854048
hal/Hal.gcda: 79: num counts=104, min counter=1925957, cum_counter=215119385
hal/Hal.gcda: 80: num counts=131, min counter=2211770, cum_counter=321748834
hal/Hal.gcda: 81: num counts=123, min counter=2739896, cum_counter=373169753
hal/Hal.gcda: 82: num counts=72, min counter=3277758, cum_counter=253778382
hal/Hal.gcda: 83: num counts=38, min counter=3853957, cum_counter=158229587
hal/Hal.gcda: 84: num counts=59, min counter=4384565, cum_counter=282974111
hal/Hal.gcda: 85: num counts=56, min counter=5467360, cum_counter=340377441
hal/Hal.gcda: 86: num counts=37, min counter=6569721, cum_counter=254677959
hal/Hal.gcda: 87: num counts=17, min counter=7670909, cum_counter=138198211
hal/Hal.gcda: 88: num counts=31, min counter=8797370, cum_counter=300444212
hal/Hal.gcda: 89: num counts=9, min counter=11064352, cum_counter=104597973
hal/Hal.gcda: 90: num counts=5, min counter=13196116, cum_counter=68483280
hal/Hal.gcda: 91: num counts=25, min counter=15471823, cum_counter=405406333
hal/Hal.gcda: 92: num counts=39, min counter=17739191, cum_counter=769153481
hal/Hal.gcda: 93: num counts=1, min counter=23220597, cum_counter=23248710
hal/Hal.gcda: 94: num counts=1, min counter=26834310, cum_counter=26862423
hal/Hal.gcda: 95: num counts=5, min counter=31885437, cum_counter=169003071
hal/Hal.gcda: 96: num counts=1, min counter=33576018, cum_counter=34881284
hal/Hal.gcda: 99: num counts=1, min counter=60798823, cum_counter=60799245
hal/Hal.gcda: 102: num counts=2, min counter=100714244, cum_counter=204154195
hal/Hal.gcda: counter working sets:
hal/Hal.gcda: 0.78%: num counts=1, min counter=100714244
hal/Hal.gcda: 1.56%: num counts=2, min counter=100714244
hal/Hal.gcda: 2.34%: num counts=3, min counter=60798823
hal/Hal.gcda: 3.12%: num counts=5, min counter=31885437
hal/Hal.gcda: 3.90%: num counts=7, min counter=31885437
hal/Hal.gcda: 4.68%: num counts=9, min counter=31885437
hal/Hal.gcda: 5.46%: num counts=12, min counter=17739191
hal/Hal.gcda: 6.24%: num counts=17, min counter=17739191
hal/Hal.gcda: 7.02%: num counts=21, min counter=17739191
hal/Hal.gcda: 7.80%: num counts=25, min counter=17739191
hal/Hal.gcda: 8.58%: num counts=29, min counter=17739191
hal/Hal.gcda: 9.36%: num counts=34, min counter=17739191
hal/Hal.gcda: 10.14%: num counts=38, min counter=17739191
hal/Hal.gcda: 10.92%: num counts=42, min counter=17739191
hal/Hal.gcda: 11.70%: num counts=47, min counter=17739191
hal/Hal.gcda: 12.48%: num counts=50, min counter=17739191
hal/Hal.gcda: 13.26%: num counts=51, min counter=15471823
hal/Hal.gcda: 14.04%: num counts=56, min counter=15471823
hal/Hal.gcda: 14.82%: num counts=61, min counter=15471823
hal/Hal.gcda: 15.60%: num counts=66, min counter=15471823
hal/Hal.gcda: 16.38%: num counts=71, min counter=15471823
hal/Hal.gcda: 17.16%: num counts=75, min counter=15471823
hal/Hal.gcda: 17.94%: num counts=80, min counter=13196116
hal/Hal.gcda: 18.72%: num counts=86, min counter=11064352
hal/Hal.gcda: 19.50%: num counts=94, min counter=8797370
hal/Hal.gcda: 20.28%: num counts=102, min counter=8797370
hal/Hal.gcda: 21.06%: num counts=111, min counter=8797370
hal/Hal.gcda: 21.84%: num counts=120, min counter=8797370
hal/Hal.gcda: 22.62%: num counts=126, min counter=7670909
hal/Hal.gcda: 23.40%: num counts=136, min counter=7670909
hal/Hal.gcda: 24.18%: num counts=146, min counter=6569721
hal/Hal.gcda: 24.96%: num counts=158, min counter=6569721
hal/Hal.gcda: 25.74%: num counts=169, min counter=6569721
hal/Hal.gcda: 26.52%: num counts=180, min counter=5467360
hal/Hal.gcda: 27.30%: num counts=194, min counter=5467360
hal/Hal.gcda: 28.08%: num counts=208, min counter=5467360
hal/Hal.gcda: 28.86%: num counts=222, min counter=5467360
hal/Hal.gcda: 29.64%: num counts=230, min counter=5467360
hal/Hal.gcda: 30.42%: num counts=247, min counter=4384565
hal/Hal.gcda: 31.20%: num counts=264, min counter=4384565
hal/Hal.gcda: 31.98%: num counts=281, min counter=4384565
hal/Hal.gcda: 32.76%: num counts=294, min counter=3853957
hal/Hal.gcda: 33.54%: num counts=313, min counter=3853957
hal/Hal.gcda: 34.32%: num counts=331, min counter=3277758
hal/Hal.gcda: 35.10%: num counts=354, min counter=3277758
hal/Hal.gcda: 35.88%: num counts=377, min counter=3277758
hal/Hal.gcda: 36.66%: num counts=399, min counter=3277758
hal/Hal.gcda: 37.44%: num counts=422, min counter=2739896
hal/Hal.gcda: 38.22%: num counts=450, min counter=2739896
hal/Hal.gcda: 39.00%: num counts=477, min counter=2739896
hal/Hal.gcda: 39.78%: num counts=505, min counter=2739896
hal/Hal.gcda: 40.56%: num counts=522, min counter=2739896
hal/Hal.gcda: 41.34%: num counts=554, min counter=2211770
hal/Hal.gcda: 42.12%: num counts=588, min counter=2211770
hal/Hal.gcda: 42.90%: num counts=622, min counter=2211770
hal/Hal.gcda: 43.68%: num counts=653, min counter=2211770
hal/Hal.gcda: 44.46%: num counts=680, min counter=1925957
hal/Hal.gcda: 45.24%: num counts=720, min counter=1925957
hal/Hal.gcda: 46.02%: num counts=757, min counter=1925957
hal/Hal.gcda: 46.80%: num counts=797, min counter=1643302
hal/Hal.gcda: 47.58%: num counts=843, min counter=1643302
hal/Hal.gcda: 48.36%: num counts=890, min counter=1643302
hal/Hal.gcda: 49.14%: num counts=929, min counter=1362634
hal/Hal.gcda: 49.92%: num counts=985, min counter=1362634
hal/Hal.gcda: 50.70%: num counts=1041, min counter=1362634
hal/Hal.gcda: 51.48%: num counts=1097, min counter=1362634
hal/Hal.gcda: 52.26%: num counts=1132, min counter=1084378
hal/Hal.gcda: 53.04%: num counts=1202, min counter=1084378
hal/Hal.gcda: 53.82%: num counts=1272, min counter=1084378
hal/Hal.gcda: 54.60%: num counts=1342, min counter=1084378
hal/Hal.gcda: 55.38%: num counts=1412, min counter=1084378
hal/Hal.gcda: 56.16%: num counts=1446, min counter=951474
hal/Hal.gcda: 56.94%: num counts=1526, min counter=951474
hal/Hal.gcda: 57.72%: num counts=1606, min counter=951474
hal/Hal.gcda: 58.50%: num counts=1684, min counter=951474
hal/Hal.gcda: 59.28%: num counts=1760, min counter=811897
hal/Hal.gcda: 60.06%: num counts=1854, min counter=811897
hal/Hal.gcda: 60.84%: num counts=1948, min counter=811897
hal/Hal.gcda: 61.62%: num counts=2029, min counter=811897
hal/Hal.gcda: 62.40%: num counts=2124, min counter=672090
hal/Hal.gcda: 63.18%: num counts=2237, min counter=672090
hal/Hal.gcda: 63.96%: num counts=2351, min counter=672090
hal/Hal.gcda: 64.74%: num counts=2425, min counter=536827
hal/Hal.gcda: 65.52%: num counts=2567, min counter=536827
hal/Hal.gcda: 66.30%: num counts=2709, min counter=536827
hal/Hal.gcda: 67.08%: num counts=2851, min counter=536827
hal/Hal.gcda: 67.86%: num counts=2993, min counter=536827
hal/Hal.gcda: 68.64%: num counts=3070, min counter=469085
hal/Hal.gcda: 69.42%: num counts=3232, min counter=469085
hal/Hal.gcda: 70.20%: num counts=3395, min counter=469085
hal/Hal.gcda: 70.98%: num counts=3543, min counter=401495
hal/Hal.gcda: 71.76%: num counts=3733, min counter=401495
hal/Hal.gcda: 72.54%: num counts=3923, min counter=401495
hal/Hal.gcda: 73.32%: num counts=4071, min counter=333978
hal/Hal.gcda: 74.10%: num counts=4299, min counter=333978
hal/Hal.gcda: 74.88%: num counts=4527, min counter=333978
hal/Hal.gcda: 75.66%: num counts=4753, min counter=333978
hal/Hal.gcda: 76.44%: num counts=4961, min counter=266317
hal/Hal.gcda: 77.22%: num counts=5247, min counter=266317
hal/Hal.gcda: 78.00%: num counts=5533, min counter=266317
hal/Hal.gcda: 78.78%: num counts=5819, min counter=266317
hal/Hal.gcda: 79.56%: num counts=5980, min counter=232660
hal/Hal.gcda: 80.34%: num counts=6308, min counter=232660
hal/Hal.gcda: 81.12%: num counts=6603, min counter=198695
hal/Hal.gcda: 81.90%: num counts=6986, min counter=198695
hal/Hal.gcda: 82.68%: num counts=7370, min counter=198695
hal/Hal.gcda: 83.46%: num counts=7722, min counter=165900
hal/Hal.gcda: 84.24%: num counts=8181, min counter=165900
hal/Hal.gcda: 85.02%: num counts=8621, min counter=132608
hal/Hal.gcda: 85.80%: num counts=9195, min counter=132608
hal/Hal.gcda: 86.58%: num counts=9636, min counter=115741
hal/Hal.gcda: 87.36%: num counts=10284, min counter=115741
hal/Hal.gcda: 88.14%: num counts=11007, min counter=98946
hal/Hal.gcda: 88.92%: num counts=11704, min counter=98946
hal/Hal.gcda: 89.70%: num counts=12574, min counter=82393
hal/Hal.gcda: 90.48%: num counts=13499, min counter=65856
hal/Hal.gcda: 91.26%: num counts=14569, min counter=65856
hal/Hal.gcda: 92.04%: num counts=15700, min counter=57610
hal/Hal.gcda: 92.82%: num counts=17240, min counter=49329
hal/Hal.gcda: 93.60%: num counts=18930, min counter=41047
hal/Hal.gcda: 94.38%: num counts=20933, min counter=32825
hal/Hal.gcda: 95.16%: num counts=23128, min counter=28719
hal/Hal.gcda: 95.94%: num counts=26146, min counter=20488
hal/Hal.gcda: 96.72%: num counts=30069, min counter=16389
hal/Hal.gcda: 97.50%: num counts=35296, min counter=10241
hal/Hal.gcda: 98.28%: num counts=43669, min counter=6145
hal/Hal.gcda: 99.06%: num counts=59589, min counter=3072
hal/Hal.gcda: 99.90%: num counts=115840, min counter=320


Teresa

>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.



--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Comment 159 Jan Hubicka 2012-12-12 20:35:37 UTC
> hal/Hal.gcda:           96.72%: num counts=30069, min counter=16389
> hal/Hal.gcda:           97.50%: num counts=35296, min counter=10241
> hal/Hal.gcda:           98.28%: num counts=43669, min counter=6145
> hal/Hal.gcda:           99.06%: num counts=59589, min counter=3072
> hal/Hal.gcda:           99.90%: num counts=115840, min counter=320
> 
> So it looks like you would want a cutoff of 97.5% to get close to what
> was there before.

Setting the default cutoff to something like 95% would sound fine to me.  I
see i asked to reduce the parameter but suggested 990. Markus, can you
try setting HOT_BB_COUNT_WS_PERMILLE to 950?

Honza
Comment 160 Markus Trippelsdorf 2012-12-13 09:52:37 UTC
(In reply to comment #159)
> > hal/Hal.gcda:           96.72%: num counts=30069, min counter=16389
> > hal/Hal.gcda:           97.50%: num counts=35296, min counter=10241
> > hal/Hal.gcda:           98.28%: num counts=43669, min counter=6145
> > hal/Hal.gcda:           99.06%: num counts=59589, min counter=3072
> > hal/Hal.gcda:           99.90%: num counts=115840, min counter=320
> > 
> > So it looks like you would want a cutoff of 97.5% to get close to what
> > was there before.
> 
> Setting the default cutoff to something like 95% would sound fine to me.  I
> see i asked to reduce the parameter but suggested 990. Markus, can you
> try setting HOT_BB_COUNT_WS_PERMILLE to 950?

It doesn't help:

 HOT_BB_COUNT_WS_PERMILLE=950: size of libxul.so: 42149632 bytes

(In reply to comment #157)
> (Unfortunately this new ICE happens with yesterdays gcc when linking libxul:
> 
> /var/tmp/mozilla-central/content/base/src/nsDocument.cpp: In member function
> ‘CreateRange’:
> /var/tmp/mozilla-central/content/base/src/nsDocument.cpp:4999:0: internal
> compiler error: in cgraph_mark_address_taken_node, at cgraph.c:1409
> 
> I will open a new PR for this later.)

See PR55669
Comment 161 Markus Trippelsdorf 2012-12-13 12:59:59 UTC
I've opened a new bug for the binary size increase issue: 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
Comment 162 Markus Trippelsdorf 2012-12-13 22:25:27 UTC
The libxul binary size issue is solved now.

During testing I came across another issue that looks similar 
to the one Comment 146:
/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccwu5G98.ltrans4.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN13nsXUL
Document14MaybeBroadcastEv.429466' which may overflow at runtime; recompile with -fPIC
/tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_NewRunnableMethod<nsXULDocument*, void (nsXU
LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.39398] [clone .constprop.84952]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .429466]'
/tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_NewRunnableMethod<nsXULDocument*, void (nsXU
LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.39398] [clone .constprop.84952]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .429466]'
collect2: error: ld returned 1 exit status

After I deleted both nsXULDocument.o and nsXULDocument.gcda and rebuild with:
 make -f client.mk realbuild MOZ_PROFILE_USE=1 
the problem did go away.
Comment 163 Jan Hubicka 2012-12-14 18:24:31 UTC
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
> 
> --- Comment #162 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 22:25:27 UTC ---
> The libxul binary size issue is solved now.

Good
> 
> During testing I came across another issue that looks similar 
> to the one Comment 146:
> /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld:
> error: /tmp/ccwu5G98.ltrans4.ltrans.o: requires dynamic R_X86_64_PC32 reloc
> against '_ZN13nsXUL
> Document14MaybeBroadcastEv.429466' which may overflow at runtime; recompile
> with -fPIC
> /tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function
> nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type*
> NS_NewRunnableMethod<nsXULDocument*, void (nsXU
> LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone
> .local.39398] [clone .constprop.84952]: error: undefined reference to
> 'nsXULDocument::MaybeBroadcast() [clone .429466]'
> /tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function
> nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type*
> NS_NewRunnableMethod<nsXULDocument*, void (nsXU
> LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone
> .local.39398] [clone .constprop.84952]: error: undefined reference to
> 'nsXULDocument::MaybeBroadcast() [clone .429466]'
> collect2: error: ld returned 1 exit status
> 
> After I deleted both nsXULDocument.o and nsXULDocument.gcda and rebuild with:
>  make -f client.mk realbuild MOZ_PROFILE_USE=1 
> the problem did go away.

This sounds like an independent problem with partitining.  I am travelling till 17th, so I will
try to check this locally myself.

Perhaps you can give details on your setup? (i.e. my Mozilla tree got quite dirty with various local
hacks I made over time, perhaps I should refresh to cleaner state)

Honza
Comment 164 Leo Yuriev 2013-01-06 00:31:55 UTC
Some trouble while building LLVM with -flto.

../x86_64-linux-gnu/bin/ld.gold: error: /tmp/cc60XH2F.ltrans0.ltrans.o: requires dynamic R_X86_64_PC32 reloc against 'X86CompilationCallback2' which may overflow at runtime; recompile with -fPIC

Code:

extern "C" {
  void X86CompilationCallback(void);
  asm(
    ".text\n"
    ".align 8\n"
    ".globl " ASMPREFIX "X86CompilationCallback\n"
    TYPE_FUNCTION(X86CompilationCallback)
  ASMPREFIX "X86CompilationCallback:\n"
...
    "movq    8(%rbp), %rdx\n"
    "call    " ASMPREFIX "X86CompilationCallback2\n"
    "addq    $32, %rsp\n"
...
  );
}

void __attribute__((used))
X86CompilationCallback2(intptr_t *StackPtr, intptr_t RetAddr) {
  intptr_t *RetAddrLoc = &StackPtr[1];
...
}

}
Comment 165 Jan Hubicka 2013-01-09 15:16:26 UTC
OK, I tracked down the undefined reference to
error: /tmp/cc0oq4BG.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc
against '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' which may overflow at
runtime; recompile with -fPIC

it is caused by bug in Mozilla - it includes file defininig virtual function that use  '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' (in SkPaint) but it never links with implementation.
Normally the function is optimized out.  It is not due to fact that we never optimize out virtual functions prior inlining for devirtualization and in WPA path we forget to remove these when done.

Fixed by the following patch
Index: ipa-inline.c
===================================================================
--- ipa-inline.c        (revision 194916)
+++ ipa-inline.c        (working copy)
@@ -1793,7 +1793,7 @@
     }
 
   inline_small_functions ();
-  symtab_remove_unreachable_nodes (true, dump_file);
+  symtab_remove_unreachable_nodes (false, dump_file);
   free (order);
 
   /* Inline functions with a property that after inlining into all callers the
Index: lto/lto.c
===================================================================
--- lto/lto.c   (revision 194916)
+++ lto/lto.c   (working copy)
@@ -3215,6 +3215,7 @@
   cgraph_state = CGRAPH_STATE_IPA_SSA;
 
   execute_ipa_pass_list (all_regular_ipa_passes);
+  symtab_remove_unreachable_nodes (false, dump_file);
 
   if (cgraph_dump_file)
     {
Index: cgraphclones.c
===================================================================
--- cgraphclones.c      (revision 194916)
+++ cgraphclones.c      (working copy)
@@ -184,6 +184,7 @@
   new_node->symbol.decl = decl;
   symtab_register_node ((symtab_node)new_node);
   new_node->origin = n->origin;
+  new_node->symbol.lto_file_data = n->symbol.lto_file_data;
   if (new_node->origin)
     {
       new_node->next_nested = new_node->origin->nested;
Comment 166 Jan Hubicka 2013-01-09 15:19:41 UTC
Markus, the apperance of undefined references I fixed by patch above is highly sensitive to partitioning and inlining decision.  Can you, please, check if the problem with PGO remains?  It may be another instance of the same issue.
Comment 167 Markus Trippelsdorf 2013-01-09 19:58:33 UTC
(In reply to comment #166)
> Markus, the apperance of undefined references I fixed by patch above is highly
> sensitive to partitioning and inlining decision.  Can you, please, check if the
> problem with PGO remains?  It may be another instance of the same issue.

Just checked it using your patch from comment 165, but the issue from
comment 162 is still there:

/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccACx905.ltrans6.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN13nsXULDocument14MaybeBroadcastEv.466048' which may overflow at runtime; recompile with -fPIC
/tmp/ccACx905.ltrans6.ltrans.o:ccACx905.ltrans6.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_N
ewRunnableMethod<nsXULDocument*, void (nsXULDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.42120] [clone .constprop.89117]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .466048]'
/tmp/ccACx905.ltrans6.ltrans.o:ccACx905.ltrans6.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_N
ewRunnableMethod<nsXULDocument*, void (nsXULDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.42120] [clone 
.constprop.89117]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .466048]'

Also the memory usage went through the roof (not sure if this caused
by your patch or my recent git-pull of mozilla-central): 
over 9GB RAM is needed (not much fun on my 8GB test-machine).

(So I will stop testing Firfox for now, until LTO/PGO memory usage
gets sane again (hopefully for 4.9).)
Comment 168 Jan Hubicka 2013-01-09 21:20:46 UTC
Too bad :( 
The patch should reduce memory usage, not increase it.  So it must be something else.  

My build was around 7GB w/o PGO, I will need to try the PGO builds myself.
My tree is however somewhat out of date. I will try fresh checkout and post mem usage stats.

Perhaps you can share smewhere the -lm.res and *wpa*cgraph dump of --save-temps -fdump-ipa-cgraph build?  I will try to figure out those symbols.
Comment 169 Jan Hubicka 2013-01-09 21:22:33 UTC
Author: hubicka
Date: Wed Jan  9 21:22:26 2013
New Revision: 195066

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195066
Log:

	PR lto/45375
	* ipa-inline.c (ipa_inline): Remove extern inlines and virtual functions.
	* cgraphclones.c (cgraph_clone_node): Cpoy also LTO file data.
	* lto.c (do_whole_program_analysis): Remove unreachable nodes after IPA.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cgraphclones.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/lto/ChangeLog
    trunk/gcc/lto/lto.c
Comment 170 Jan Hubicka 2013-01-10 15:04:10 UTC
OK, here is updated memory use:
cgraph.c:863 (cgraph_allocate_init_indirect_info    5905200: 0.1%          0: 0.0%    6020160: 0.1%          0: 0.0%     298134
tree.c:1237 (build_int_cst_wide)                   15554272: 0.4%          0: 0.0%     782528: 0.0%          0: 0.0%     510525
tree.c:1559 (build_string)                         10685931: 0.2%          0: 0.0%   16715642: 0.4%    2193469: 1.7%     563828
stringpool.c:75 (alloc_node)                              0: 0.0%          0: 0.0%   30574880: 0.7%          0: 0.0%     764372
lto/lto.c:2286 (create_subid_section_table)         1522184: 0.0%          0: 0.0%   39117064: 0.8%    8051472: 6.4%       3978
stringpool.c:58 (stringpool_ggc_alloc)                    0: 0.0%          0: 0.0%   41092405: 0.9%    2954893: 2.4%     764372
gimple.c:3167 (iterative_hash_canonical_type)      45040752: 1.0%          0: 0.0%          0: 0.0%          0: 0.0%    2815047
lto/lto.c:1222 (iterative_hash_gimple_type)        68276864: 1.6%          0: 0.0%          0: 0.0%          0: 0.0%    4267304
ggc-common.c:249 (ggc_cleared_alloc_ptr_array_tw      91784: 0.0%  487289424:48.8%   71432600: 1.5%     248976: 0.2%      10974
lto/lto.c:1266 (iterative_hash_gimple_type)        75288576: 1.8%          0: 0.0%          0: 0.0%          0: 0.0%    4705536
lto-section-in.c:362 (lto_new_in_decl_state)         694320: 0.0%          0: 0.0%   94861800: 2.0%          0: 0.0%     796301
tree.c:1263 (build_int_cst_wide)                   76232736: 1.8%          0: 0.0%   19358880: 0.4%          0: 0.0%    2987238
cgraph.c:794 (cgraph_create_edge_1)                       0: 0.0%          0: 0.0%  125510632: 2.7%          0: 0.0%    1206833
vec.h:565 ((null))                                 66034564: 1.5%      98716: 0.0%   68500548: 1.5%    3484420: 2.8%     597783
vec.h:695 ((null))                                124654648: 2.9%  122044288:12.2%   63749232: 1.4%    2614800: 2.1%    1590429
tree-streamer-in.c:562 (streamer_alloc_tree)      125829312: 2.9%          0: 0.0%   74222904: 1.6%       7072: 0.0%    2005091
lto/lto.c:267 (lto_read_in_decl_state)              1478720: 0.0%          0: 0.0%  216390688: 4.7%   38247784:30.5%    5574107
vec.h:747 ((null))                                173791988: 4.0%   19565412: 2.0%   68225644: 1.5%    2680332: 2.1%    1396070
vec.h:707 ((null))                                133872480: 3.1%          0: 0.0%  285212728: 6.1%     800360: 0.6%    1059913
cgraph.c:500 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  472831880:10.2%          0: 0.0%    1597405
tree.c:1223 (build_int_cst_wide)                  607138944:14.1%          0: 0.0%   10427664: 0.2%    4719336: 3.8%     315034
toplev.c:959 (realloc_for_line_map)                       0: 0.0%  358037664:35.8% 1073872920:23.1%        184: 0.0%         16
tree-streamer-in.c:573 (streamer_alloc_tree)     2762184192:64.2%          0: 0.0% 1861017624:40.0%   59027616:47.1%   34649937
Total                                            4302007795        999178184       4651003487        125411458         68828967
source location                                     Garbage            Freed             Leak         Overhead            Times
-------------------------------------------------------


Actually it is a bit of improvement over my past report.  Some obvious things
1) we still soak in too many trees (40%) of memory.  The per-tree stats are:
decls                17310018 -1609736744
types                8983387 1509209016
exprs                2427302   80045744
constants            4079292  135393547
binfos               2005091  200038072
random kinds         5691481  227659664

and counts:
tree_list            5691475       
pointer_type         2337585
record_type          3702066       
function_decl        1856282
field_decl           2812564
const_decl           2739702
parm_decl            3549707
type_decl            4780459
result_decl          1144482
tree_binfo           2005091

2) new linemaps are still a disaster
3) VEC rewrite did break stats.

Honza
Comment 171 Jan Hubicka 2013-01-16 17:25:04 UTC
Created attachment 29182 [details]
Patch to compress line info

This patch removes column information from LTO (so we lose carret diagnostics in warnings/errors output at LTO time that seems resonable thing to do) and avoid entering duplicate locators into the linemap.  The patch reduces linemap usage from 23% to 5% of GGC memory saving 1-2GB on Mozilla. (also reducing LTO file size).
Comment 172 Richard Biener 2013-01-17 10:53:29 UTC
(In reply to comment #171)
> Created attachment 29182 [details]
> Patch to compress line info
> 
> This patch removes column information from LTO (so we lose carret diagnostics
> in warnings/errors output at LTO time that seems resonable thing to do) and
> avoid entering duplicate locators into the linemap.  The patch reduces linemap
> usage from 23% to 5% of GGC memory saving 1-2GB on Mozilla. (also reducing LTO
> file size).

Patch looks incomplete?  What does dropping columns only do to memory use?
Please disable flag_diagnostics_show_caret unconditionally in lto1 if you
do that.
Comment 173 Jan Hubicka 2013-01-17 12:30:30 UTC
> Patch looks incomplete?  What does dropping columns only do to memory use?

I will check.  I remember that prior columns there was also some savings for the cache.
Just saving 20% out of 23% is cooler than saving 20% out of 5% of memory.
Note that we are still over 8GB for Mozilla LTO after latest Mozilla checkout.  

> Please disable flag_diagnostics_show_caret unconditionally in lto1 if you
> do that.

Yeah, I wanted, but I am not sure where in lto.c is proper place to do so?
Comment 174 Jakub Jelinek 2013-01-17 12:42:06 UTC
lto_post_options ?
Comment 175 Jan Hubicka 2013-01-17 14:40:04 UTC
Created attachment 29191 [details]
alternative patch without the compression.

This is alternative patch just skipping columns but not doing the compression.
It seems that compression is actually quite effective.
Non-compressing w/o column info is 1073872920 bytes,
compression + no column is 268566544 bytes
compression + column is 1073872920 bytes

Perhaps I messed up the caching with column info?  It strikes wrong that the numbers are precisely the same. But perhaps it is just reallocation strategy. I will also generate fresh numbers for unpatched GCC.
Comment 176 Richard Biener 2013-01-17 14:54:22 UTC
(In reply to comment #175)
> Created attachment 29191 [details]
> alternative patch without the compression.
> 
> This is alternative patch just skipping columns but not doing the compression.
> It seems that compression is actually quite effective.
> Non-compressing w/o column info is 1073872920 bytes,
> compression + no column is 268566544 bytes
> compression + column is 1073872920 bytes
> 
> Perhaps I messed up the caching with column info?  It strikes wrong that the
> numbers are precisely the same. But perhaps it is just reallocation strategy. I
> will also generate fresh numbers for unpatched GCC.

+    linemap_line_start (line_table, data_in->current_line, 0);

-  return linemap_position_for_column (line_table, data_in->current_col);
+  return linemap_position_for_column (line_table, 0);

linemap_line_start will aready return a location for column 0.

So I'd say we want

  if (file_change)
    {
      ...
    }

  return linemap_line_start (line_table, data_in->current_line, 0);

instead.  Which hopefully does nothing if nothing changed.

I don't know how you implement caching - you didn't attach a patch to do so.
Comment 177 Jan Hubicka 2013-01-17 15:13:53 UTC
Created attachment 29192 [details]
caching

Aha, now I see why you ask for complete patch. I obviously messed up the code.  This is how I do caching (in version that still has columns in it). I removed the final incarnation of the patch, but it should be easy to re-do.
Comment 178 Jan Hubicka 2013-01-17 17:11:13 UTC
The global cache with arbitrary large size reduces usage down to 0.3% (16908304) bytes. So it seems that sharing across files is quite an important part of the game.  I will try to fiddle with the cache size to see how big cache is actually needed.

Unpatches mainline needs 1073872920 bytes, that is the same as with dropping columns and/or my initial local caching implementation.  This is apparently because of the exponential resizing of the table (i.e. we simply do not save enough to see a difference).

Honza
Comment 179 Martin Jambor 2013-03-06 15:14:35 UTC
I'm currently (gcc revision 196427, FF changeset 123831:c95439870e05)
facing a few ICEs during the compilation phase with the following
backtrace:

#0  0x0000000000f89a73 in get_location_from_adhoc_loc (set=0x7ffff7ff2000,
    loc=2947526575) at /home/mjambor/gcc/trunk/src/libcpp/line-map.c:165
#1  0x0000000000c247fe in inlined_function_outer_scope_p (block=0x7fffee4bcb28)
    at /home/mjambor/gcc/trunk/src/gcc/tree.h:5561
#2  pack_ts_block_value_fields (expr=0x7fffee4bcb28, bp=0x7fffffffd1a0, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:319
#3  streamer_pack_tree_bitfields (ob=0x1c73210, bp=0x7fffffffd1a0, expr=0x7fffee4bcb28)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:417
#4  0x00000000009c3bc9 in lto_write_tree (ref_p=true, expr=0x7fffee4bcb28, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:317
#5  lto_output_tree (ob=0x1c73210, expr=0x7fffee4bcb28, ref_p=true,
    this_ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410
#6  0x0000000000c26617 in write_ts_common_tree_pointers (ref_p=true,
    expr=0x7ffff3f6bc80, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:514
#7  streamer_write_tree_body (ob=0x1c73210, expr=0x7ffff3f6bc80, ref_p=<optimized out>)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:845
#8  0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7ffff3f6bc80, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321
#9  lto_output_tree (ob=ob@entry=0x1c73210, expr=0x7ffff3f6bc80, ref_p=ref_p@entry=true,
    this_ref_p=this_ref_p@entry=true)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410
#10 0x0000000000c26e62 in write_ts_exp_tree_pointers (ref_p=<optimized out>,
    expr=<optimized out>, ob=<optimized out>)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:747
#11 streamer_write_tree_body (ob=0x1c73210, expr=0x7fffecc63dc0, ref_p=<optimized out>)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:884
#12 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7fffecc63dc0, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321
#13 lto_output_tree (ob=0x1c73210, expr=0x7fffecc63dc0, ref_p=true,
    this_ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410
#14 0x0000000000c26df8 in write_ts_exp_tree_pointers (ref_p=<optimized out>,
    expr=<optimized out>, ob=<optimized out>)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:746
#15 streamer_write_tree_body (ob=0x1c73210, expr=0x7fffecc70078, ref_p=<optimized out>)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:884
#16 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7fffecc70078, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321
#17 lto_output_tree (ob=ob@entry=0x1c73210, expr=0x7fffecc70078, ref_p=ref_p@entry=true,
    this_ref_p=this_ref_p@entry=true)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410
#18 0x0000000000c2681d in write_ts_decl_common_tree_pointers (ref_p=true,
    expr=0x7fffecc6d720, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:584
#19 streamer_write_tree_body (ob=0x1c73210, expr=0x7fffecc6d720, ref_p=<optimized out>)
    at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:857
#20 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7fffecc6d720, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321
#21 lto_output_tree (ob=0x1c73210, expr=0x7fffecc6d720, ref_p=true,
    this_ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410
#22 0x0000000000ecd118 in output_gimple_stmt (stmt=0x7fffec6206c0, ob=0x1c73210)
    at /home/mjambor/gcc/trunk/src/gcc/gimple-streamer-out.c:143
#23 output_bb (ob=0x1c73210, bb=0x7fffed130f08, fn=0x7fffef8603f0)
    at /home/mjambor/gcc/trunk/src/gcc/gimple-streamer-out.c:199
#24 0x00000000009c4f26 in output_function (node=0x7fffef8614a0)
    at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:823
#25 lto_output () at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:987
#26 0x00000000009fa971 in ipa_write_summaries_2 (
    pass=0x1618f00 <pass_ipa_lto_gimple_out>, state=0x1ad8c00)
    at /home/mjambor/gcc/trunk/src/gcc/passes.c:2408

The statement being written is:
(gdb) call debug_gimple_stmt ((gimple)0x7fffec6206c0)
# DEBUG v => 18444633011384221696

This happens for example during compilation of
js/src/ion/shared/CodeGenerator-shared.cpp
Comment 180 Richard Biener 2013-03-07 16:08:29 UTC
Try

Index: gcc/tree-inline.c
===================================================================
--- gcc/tree-inline.c   (revision 196520)
+++ gcc/tree-inline.c   (working copy)
@@ -3929,7 +3929,7 @@ expand_call_inline (basic_block bb, gimp
     {
       id->block = make_node (BLOCK);
       BLOCK_ABSTRACT_ORIGIN (id->block) = fn;
-      BLOCK_SOURCE_LOCATION (id->block) = input_location;
+      BLOCK_SOURCE_LOCATION (id->block) = LOCATION_LOCUS (input_location);
       prepend_lexical_block (gimple_block (stmt), id->block);
     }
Comment 181 Martin Jambor 2013-03-08 10:41:54 UTC
The bug described in comment #179 is now PR 56570.
Comment 182 Jan Hubicka 2013-06-17 16:33:24 UTC
OK, after a while I should update the stats here.  Richard's new tree merging patch makes libxul linking a lot faster and less memory consuming.
Peak memory usage (in TOP) is now just bellow 10GB, with bit of incremental improvmenets I hope to get bellow 8GB again soon.

Bulid time is
real    19m0.355s
user    56m20.459s
sys     2m17.533s

GGC memory usage after stream in 4938399k

Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1399 kB ( 0%) ggc
 phase opt and generate  :  72.86 (12%) usr   0.90 ( 3%) sys  75.25 (11%) wall  270952 kB ( 7%) ggc
 phase stream in         : 274.88 (44%) usr   9.01 (26%) sys 294.99 (43%) wall 3478515 kB (93%) ggc
 phase stream out        : 282.18 (45%) usr  24.40 (71%) sys 308.42 (45%) wall    7162 kB ( 0%) ggc
 garbage collection      :  12.99 ( 2%) usr   0.01 ( 0%) sys  13.00 ( 2%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   1.95 ( 0%) usr   0.00 ( 0%) sys   1.95 ( 0%) wall      32 kB ( 0%) ggc
 ipa cp                  :   9.82 ( 2%) usr   0.39 ( 1%) sys  10.26 ( 2%) wall  418482 kB (11%) ggc
 ipa inlining heuristics :  39.30 ( 6%) usr   1.12 ( 3%) sys  41.52 ( 6%) wall 1353294 kB (36%) ggc
 ipa lto gimple in       :   0.45 ( 0%) usr   0.15 ( 0%) sys   0.62 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :  18.24 ( 3%) usr   1.50 ( 4%) sys  19.86 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto decl in         : 200.68 (32%) usr   5.85 (17%) sys 216.44 (32%) wall 3887175 kB (103%) ggc
 ipa lto decl out        : 256.24 (41%) usr  13.44 (39%) sys 271.24 (40%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   7.20 ( 1%) usr   1.61 ( 5%) sys   8.83 ( 1%) wall 2134157 kB (57%) ggc
 ipa lto decl merge      :  27.71 ( 4%) usr   0.01 ( 0%) sys  27.72 ( 4%) wall    8270 kB ( 0%) ggc
 ipa lto cgraph merge    :  17.31 ( 3%) usr   0.07 ( 0%) sys  17.39 ( 3%) wall  142240 kB ( 4%) ggc
 whopr wpa               :   8.82 ( 1%) usr   0.04 ( 0%) sys   8.89 ( 1%) wall    7165 kB ( 0%) ggc
 whopr wpa I/O           :   1.63 ( 0%) usr   9.43 (27%) sys  11.19 ( 2%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   3.21 ( 1%) usr   0.04 ( 0%) sys   3.25 ( 0%) wall       0 kB ( 0%) ggc
 ipa reference           :   5.56 ( 1%) usr   0.04 ( 0%) sys   5.81 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   1.83 ( 0%) usr   0.02 ( 0%) sys   1.86 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   6.07 ( 1%) usr   0.18 ( 1%) sys   6.26 ( 1%) wall       0 kB ( 0%) ggc
 inline parameters       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      14 kB ( 0%) ggc
 tree copy propagation   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree PTA                :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      27 kB ( 0%) ggc
 tree SSA other          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree CCP                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.14 ( 0%) usr   0.12 ( 0%) sys   0.24 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :  10.69 ( 2%) usr   0.29 ( 1%) sys  11.10 ( 2%) wall       0 kB ( 0%) ggc 
 TOTAL                 : 629.93            34.31           678.67            3758029 kB

Memory usage seems about the same with -g.
Honza
Comment 183 Jan Hubicka 2013-06-17 17:28:00 UTC
type merging stats
[WPA] read 43156894 SCCs of average size 2.270660
[WPA] 97994652 tree bodies read in total
[WPA] tree SCC table: size 8388593, 3830511 elements, collision ratio: 0.684487
[WPA] tree SCC max chain length 88 (size 1)
[WPA] Compared 19139975 SCCs, 344923 collisions (0.018021)
[WPA] Merged 19067050 SCCs
[WPA] Merged 58757829 tree bodies
[WPA] Merged 11951381 types
[WPA] 4357267 types prevailed (13278034 associated trees)
[WPA] Old merging code merges an additional 2026163 types of which 140937 are in the same SCC with their prevailing variant (12389865 and 6362266 associated trees)
[WPA] GIMPLE canonical type table: size 131071, 77910 elements, 4357402 searches, 1095104 collisions (ratio: 0.251320)
[WPA] GIMPLE canonical type hash table: size 8388593, 4357346 elements, 15252531 searches, 11817317 collisions (ratio: 0.774777)
[WPA] # of input files: 4918
[WPA] # of input cgraph nodes: 0
[WPA] # of function bodies: 0
[WPA] # of output files: 0
[WPA] # of output symtab nodes: 0
[WPA] # of output tree pickle references: 0
[WPA] # of output tree bodies: 0
[WPA] # callgraph partitions: 0
[WPA] Compression: 1311851796 input bytes, 4153897270 uncompressed bytes (ratio: 3.166438)
[WPA] Size of mmap'd section decls: 1311851796 bytes
[LTRANS] read 314277 SCCs of average size 6.082532
[LTRANS] 1911600 tree bodies read in total
[LTRANS] GIMPLE canonical type table: size 16381, 9653 elements, 453967 searches, 24697 collisions (ratio: 0.054403)
[LTRANS] GIMPLE canonical type hash table: size 1048573, 453913 elements, 1562009 searches, 1517260 collisions (ratio: 0.971352)
[LTRANS] # of input files: 1
[LTRANS] # of input cgraph nodes: 0
[LTRANS] # of function bodies: 0
Comment 184 Jan Hubicka 2013-06-19 15:38:41 UTC
New profiles after Richard's changes to remove pointer maps from straming in.

Stream in:
samples  %        app name                 symbol name
36599    12.3464  lto1                     inflate_fast
27382     9.2371  lto1                     streamer_read_uhwi(lto_input_block*)
19282     6.5047  lto1                     streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
15807     5.3324  lto1                     compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
11385     3.8407  libc-2.11.1.so           msort_with_tmp
9054      3.0543  libc-2.11.1.so           memcpy
8701      2.9352  lto1                     htab_find_slot_with_hash
8506      2.8694  lto1                     lto_input_tree(lto_input_block*, data_in*)
8405      2.8354  lto1                     lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
8055      2.7173  lto1                     ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
6436      2.1711  lto1                     streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
6287      2.1209  lto1                     adler32
5891      1.9873  lto1                     streamer_get_pickled_tree(lto_input_block*, data_in*)


Stream out:
samples  %        app name                 symbol name
19885    14.6837  lto1                     DFS_write_tree(output_block*, sccs*, tree_node*, bool, bool)
19285    14.2407  lto1                     linemap_lookup(line_maps*, unsigned int)
16192    11.9567  lto1                     streamer_write_uhwi_stream(lto_output_stream*, unsigned long)
15926    11.7603  lto1                     pointer_map_insert(pointer_map_t*, void const*)
10285     7.5948  lto1                     pointer_map_contains(pointer_map_t const*, void const*)
7324      5.4083  lto1                     streamer_tree_cache_lookup(streamer_tree_cache_d*, tree_node*, unsigned int*)
5897      4.3545  lto1                     streamer_pack_tree_bitfields(output_block*, bitpack_d*, tree_node*)
5374      3.9683  lto1                     lto_output_tree(output_block*, tree_node*, bool, bool)
4896      3.6154  lto1                     streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned int, unsigned int*, bool)
3285      2.4258  libc-2.11.1.so           memset
2669      1.9709  lto1                     streamer_write_tree_body(output_block*, tree_node*, bool)
2520      1.8608  libc-2.11.1.so           memcpy
2383      1.7597  lto1                     streamer_tree_cache_add_to_node_array(streamer_tree_cache_d*, unsigned int, tree_node*, unsigned int)

linemap_lookup is easy target, obviously.

Execution times (seconds)
 phase setup             :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1399 kB ( 0%) ggc
 phase opt and generate  :  69.29 (14%) usr   0.82 ( 3%) sys  70.62 (13%) wall  270269 kB (11%) ggc
 phase stream in         : 224.95 (44%) usr   6.23 (22%) sys 236.02 (43%) wall 2174294 kB (89%) ggc
 phase stream out        : 213.26 (42%) usr  21.35 (75%) sys 236.87 (44%) wall    7157 kB ( 0%) ggc
 garbage collection      :   9.92 ( 2%) usr   0.00 ( 0%) sys   9.99 ( 2%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   1.36 ( 0%) usr   0.00 ( 0%) sys   1.34 ( 0%) wall      32 kB ( 0%) ggc
 ipa cp                  :   7.65 ( 2%) usr   0.32 ( 1%) sys   8.01 ( 1%) wall  418436 kB (17%) ggc
 ipa inlining heuristics :  38.83 ( 8%) usr   0.83 ( 3%) sys  39.99 ( 7%) wall 1352530 kB (55%) ggc
 ipa lto gimple in       :   0.39 ( 0%) usr   0.05 ( 0%) sys   0.53 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :  16.46 ( 3%) usr   1.39 ( 5%) sys  17.93 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto decl in         : 158.55 (31%) usr   3.99 (14%) sys 166.99 (31%) wall 2583106 kB (105%) ggc
 ipa lto decl out        : 191.10 (38%) usr  11.48 (40%) sys 203.47 (37%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   7.07 ( 1%) usr   1.17 ( 4%) sys   8.27 ( 2%) wall 2134131 kB (87%) ggc
 ipa lto decl merge      :  29.94 ( 6%) usr   0.01 ( 0%) sys  30.06 ( 6%) wall    8270 kB ( 0%) ggc
 ipa lto cgraph merge    :  12.02 ( 2%) usr   0.04 ( 0%) sys  12.13 ( 2%) wall  142240 kB ( 6%) ggc
 whopr wpa               :   7.30 ( 1%) usr   0.03 ( 0%) sys   7.39 ( 1%) wall    7160 kB ( 0%) ggc
 whopr wpa I/O           :   1.40 ( 0%) usr   8.46 (30%) sys  11.14 ( 2%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   2.33 ( 0%) usr   0.01 ( 0%) sys   2.36 ( 0%) wall       0 kB ( 0%) ggc
 ipa reference           :   5.44 ( 1%) usr   0.04 ( 0%) sys   5.53 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   1.26 ( 0%) usr   0.04 ( 0%) sys   1.32 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   5.87 ( 1%) usr   0.13 ( 0%) sys   6.03 ( 1%) wall       0 kB ( 0%) ggc
 inline parameters       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      14 kB ( 0%) ggc
 tree eh                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree PTA                :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      27 kB ( 0%) ggc
 tree SSA other          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree FRE                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.10 ( 0%) usr   0.18 ( 1%) sys   0.19 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :  10.42 ( 2%) usr   0.23 ( 1%) sys  10.76 ( 2%) wall       0 kB ( 0%) ggc
 TOTAL                 : 507.52            28.40           543.51            2453120 kB
Comment 185 Jan Hubicka 2013-08-02 14:18:50 UTC
I merged in some patches intended to reduce memory of Firefox LTO and also updated firefox tree. Some more involved patches are on the way, so it is summary where we stand now.

WPA usage in TOP is 10GB now.

1) After streaming in trees, the GGC usage is now 5.1GB
   - 2.5GB are trees,
   - 1GB are linemaps
   - 0.8GB are decl maps (decl states)

tree_list            12561507
integer_type         1511296
pointer_type         4610735
record_type          8139077
method_type          2401664
integer_cst          6677946
string_cst           2127890
function_decl        6069299
label_decl            504859
field_decl           5104957
var_decl              596020
const_decl           5401253
parm_decl            9002744
type_decl            10150100
result_decl          2181250
addr_expr            4173661
tree_binfo           4780477


 I have cache that cuts down the linemaps + patch to not stream PARM_DECLs and RETURN_DECLs.  With this the usage goes bellow 3GB.

2) Cgraph streaming now becomes important factor.  
   GGC usage goes up to 7.7GB
   GGC use:
     - cgraph nodes themselves are 1.5GB
     - inline summaries are 0.5GB
     - cgraph edges are 3.7GB
     - IPA references 2.3GB
     - IPA-prop 0.7GB
   Off GGC
     - IPA-prop 0.6GB
     - Inline summary 0.5GB
     - symtab encoder 0.17GB

   Here one can easily
     - compress the vectors recording definitions
     - pull off parts of cgraph nodes that are not really needed by WPA (nested info, etc.)
     - perhaps implement of streaming of merged cgraph.

so good news is that we now have a lot of interesting low hanging fruit. Bad news is that tree streaming still feels slow.  I suppose we need to dig more into what trees really need to go into WPA...
Comment 186 Jan Hubicka 2013-08-02 16:32:45 UTC
oprofile of merging
67647    13.0501  lto1                     inflate_fast
38682     7.4624  lto1                     compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
32365     6.2437  lto1                     streamer_read_uhwi(lto_input_block*)
31198     6.0186  lto1                     streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
21155     4.0811  libc-2.11.1.so           msort_with_tmp
19581     3.7775  lto1                     ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
16584     3.1993  lto1                     lto_input_tree(lto_input_block*, data_in*)
15203     2.9329  lto1                     lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
15194     2.9312  libc-2.11.1.so           memcpy
14823     2.8596  lto1                     htab_find_slot_with_hash
12860     2.4809  lto1                     streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
12705     2.4510  lto1                     hash_table<tree_scc_hasher, xcallocator>::find_slot_with_hash(tree_scc const*, unsigned int, insert_option)
11773     2.2712  lto1                     adler32
11504     2.2193  libc-2.11.1.so           _IO_vfscanf
11401     2.1994  lto1                     unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
9548      1.8420  lto1                     streamer_get_pickled_tree(lto_input_block*, data_in*)
9315      1.7970  lto1                     inflate

IPA
18799     6.2862  lto1                     symtab_remove_unreachable_nodes(bool, _IO_FILE*)
11878     3.9719  lto1                     cgraph_redirect_edge_callee(cgraph_edge*, cgraph_node*)
11223     3.7528  lto1                     do_per_function(void (*)(void*), void*)
10813     3.6157  lto1                     pointer_set_lookup(pointer_set_t const*, void const*, unsigned long*)
8415      2.8139  lto1                     ipa_reverse_postorder(cgraph_node**)
7689      2.5711  lto1                     htab_find_slot_with_hash
7677      2.5671  lto1                     do_estimate_growth_1(cgraph_node*, void*)
7477      2.5002  libc-2.11.1.so           free
7035      2.3524  libc-2.11.1.so           malloc_consolidate

Stream out
9440     16.1663  lto1                     linemap_lookup(line_maps*, unsigned int)
7663     13.1231  lto1                     DFS_write_tree(output_block*, sccs*, tree_node*, bool, bool)
6052     10.3643  lto1                     streamer_write_uhwi_stream(lto_output_stream*, unsigned long)
5831      9.9858  lto1                     pointer_set_lookup(pointer_set_t const*, void const*, unsigned long*)
3342      5.7233  lto1                     streamer_tree_cache_lookup(streamer_tree_cache_d*, tree_node*, unsigned int*)
2229      3.8172  lto1                     pointer_map_insert(pointer_map_t*, void const*)
2196      3.7607  lto1                     streamer_pack_tree_bitfields(output_block*, bitpack_d*, tree_node*)
2054      3.5175  lto1                     lto_output_tree(output_block*, tree_node*, bool, bool)
1656      2.8360  lto1                     inflate_fast
1655      2.8342  lto1                     pointer_map<unsigned int>::insert(void const*, bool*)
Comment 187 Jan Hubicka 2013-08-03 08:45:00 UTC
WPA time report
Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1398 kB ( 0%) ggc
 phase opt and generate  :  80.79 (13%) usr   1.01 ( 3%) sys  81.96 (12%) wall  315727 kB (25%) ggc
 phase stream in         : 283.33 (45%) usr   7.82 (24%) sys 292.12 (44%) wall  940315 kB (74%) ggc
 phase stream out        : 261.66 (42%) usr  23.14 (72%) sys 287.88 (43%) wall    7534 kB ( 1%) ggc
 garbage collection      :  14.45 ( 2%) usr   0.02 ( 0%) sys  14.48 ( 2%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   2.55 ( 0%) usr   0.00 ( 0%) sys   2.55 ( 0%) wall      33 kB ( 0%) ggc
 ipa cp                  :  10.45 ( 2%) usr   0.36 ( 1%) sys  10.81 ( 2%) wall  456287 kB (36%) ggc
 ipa inlining heuristics :  42.12 ( 7%) usr   1.06 ( 3%) sys  43.27 ( 7%) wall 1485346 kB (117%) ggc
 ipa lto gimple in       :   0.56 ( 0%) usr   0.25 ( 1%) sys   0.87 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :  21.77 ( 3%) usr   1.72 ( 5%) sys  23.53 ( 4%) wall       0 kB ( 0%) ggc
 ipa lto decl in         : 183.90 (29%) usr   4.77 (15%) sys 189.46 (29%) wall  959299 kB (76%) ggc
 ipa lto decl out        : 231.70 (37%) usr  10.78 (34%) sys 242.73 (37%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :  14.38 ( 2%) usr   1.57 ( 5%) sys  15.99 ( 2%) wall 2405760 kB (190%) ggc
 ipa lto decl merge      :  32.16 ( 5%) usr   0.00 ( 0%) sys  32.24 ( 5%) wall    8268 kB ( 1%) ggc
 ipa lto cgraph merge    :  28.72 ( 5%) usr   0.06 ( 0%) sys  28.81 ( 4%) wall  135235 kB (11%) ggc
 whopr wpa               :   9.57 ( 2%) usr   0.05 ( 0%) sys   9.62 ( 1%) wall    7537 kB ( 1%) ggc
 whopr wpa I/O           :   2.07 ( 0%) usr  10.62 (33%) sys  15.49 ( 2%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   3.26 ( 1%) usr   0.03 ( 0%) sys   3.29 ( 0%) wall       0 kB ( 0%) ggc
 ipa reference           :   5.55 ( 1%) usr   0.05 ( 0%) sys   5.62 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   2.82 ( 0%) usr   0.05 ( 0%) sys   2.88 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   6.25 ( 1%) usr   0.13 ( 0%) sys   6.38 ( 1%) wall       0 kB ( 0%) ggc
 unaccounted todo        :  13.25 ( 2%) usr   0.28 ( 1%) sys  13.58 ( 2%) wall       0 kB ( 0%) ggc
 TOTAL                 : 625.79            31.97           661.97            1264976 kB
Comment 188 Jan Hubicka 2013-08-14 15:59:23 UTC
With patch to early remove unreachable virtual methods http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00774.html the memory usage fro Firefox WPA goes down to 3.4GB (from 10GB). Most of time is still spent by streaming:

 phase opt and generate  :  48.52 (15%) usr   0.54 ( 3%) sys  49.20 (14%) wall  391219 kB ( 6%) ggc
 phase stream in         :  87.84 (26%) usr   2.03 (10%) sys  90.15 (25%) wall 5968649 kB (94%) ggc
 phase stream out        : 197.98 (59%) usr  18.61 (88%) sys 217.58 (61%) wall    7585 kB ( 0%) ggc
 garbage collection      :   3.10 ( 1%) usr   0.00 ( 0%) sys   3.11 ( 1%) wall       0 kB ( 0%) ggc
 ipa unreachable code removal:   5.25 ( 2%) usr   0.12 ( 1%) sys   5.43 ( 2%) wall       0 kB ( 0%) ggc
 ipa inheritance graph construction:   0.26 ( 0%) usr   0.00 ( 0%) sys   0.26 ( 0%) wall    1059 kB ( 0%) ggc
 ipa virtual call target lookup:  13.76 ( 4%) usr   0.08 ( 0%) sys  13.80 ( 4%) wall   98807 kB ( 2%) ggc
 ipa cp                  :   2.79 ( 1%) usr   0.14 ( 1%) sys   2.95 ( 1%) wall  188635 kB ( 3%) ggc
 ipa inlining heuristics :  18.85 ( 6%) usr   0.24 ( 1%) sys  19.16 ( 5%) wall  439913 kB ( 7%) ggc
 ipa lto gimple out      :  18.80 ( 6%) usr   1.52 ( 7%) sys  20.39 ( 6%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  73.72 (22%) usr   1.51 ( 7%) sys  75.49 (21%) wall 5180378 kB (81%) ggc
 ipa lto decl out        : 173.97 (52%) usr   7.61 (36%) sys 181.91 (51%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.73 ( 1%) usr   0.18 ( 1%) sys   1.91 ( 1%) wall  428921 kB ( 7%) ggc
 TOTAL                 : 334.36            21.18           356.94            6368853 kB

Streaming in is rather slow because about 80% of trees streamed are duplicates.

WPA still streams out 4GB of object files that seems to be main bottleneck.  I have some experiments here.

Most common tree nodes:
tree_list            5707422
integer_type         1064175
pointer_type         2195993
record_type          4539776
integer_cst          4399813
function_decl        1127978
field_decl           3475888
const_decl           3462163
type_decl            5970713
addr_expr            1275696
tree_binfo           2903028

GGC memory is 1.6GB after tree streaming, 2.1GB after IPA streaming.

Vecs:
ipa-devirt.c:406 (get_odr_type)                      172200: 0.2%     336624            3465: 0.0%
ipa-devirt.c:407 (get_odr_type)                      330376: 0.3%     655112            8267: 0.0%
ipa-devirt.c:835 (ipa_devirt_init)                  1386952: 1.5%    2419240           15701: 0.1%
ipa-devirt.c:524 (maybe_record_node)                1678248: 1.8%    3094376           21842: 0.1%
ipa-reference.c:168 (set_reference_optimization_    5457952: 5.8%    8672960              11: 0.0%
vec.h:1460 (copy)                                   6814272: 7.2%   34335740          604256: 2.4%
ipa-inline-analysis.c:3754 (read_inline_edge_sum    7254040: 7.7%   16934500          849179: 3.4%
ipa-ref.c:54 (ipa_record_reference)                11668584:12.3%   34881384          494857: 2.0%
passes.c:2208 (execute_one_pass)                   24435584:25.8%   41942968          651148: 2.6%
ipa-inline-analysis.c:944 (inline_summary_alloc)   35603464:37.6%   58351856          200862: 0.8%
Total                                              94804952                          24781481

GGC:
cgraph.c:912 (cgraph_allocate_init_indirect_info          0: 0.0%    1487184: 0.0%    7575408: 0.3%          0: 0.0%     188804
tree.c:1263 (build_int_cst_wide)                     235456: 0.1%          0: 0.0%    8371392: 0.4%          0: 0.0%     268964
ipa-prop.c:2836 (ipa_set_node_agg_value_chain)            0: 0.0%          0: 0.0%    8388608: 0.4%          0: 0.0%          1
ipa-inline-analysis.c:716 (account_size_time)             0: 0.0%    2140820: 0.1%    9143868: 0.4%     240712: 0.3%      28736
ipa-inline-analysis.c:3820 (inline_read_section)          0: 0.0%   12942208: 0.3%   17905336: 0.8%    1287480: 1.4%     228397
ggc-common.c:244 (ggc_cleared_alloc_ptr_array_tw      61536: 0.0%  211278568: 5.0%   26406128: 1.2%     190280: 0.2%       9549
stringpool.c:74 (alloc_node)                              0: 0.0%          0: 0.0%   28859960: 1.3%          0: 0.0%     721499
ipa-ref.c:50 (ipa_record_reference)                       0: 0.0%   96510048: 2.3%   36203704: 1.6%    1329136: 1.4%     577500
lto-section-in.c:363 (lto_new_in_decl_state)         343800: 0.1%          0: 0.0%   38477520: 1.7%          0: 0.0%     323511
stringpool.c:57 (stringpool_ggc_alloc)                    0: 0.0%          0: 0.0%   44558843: 2.0%    2783411: 3.0%     721499
tree-streamer-in.c:482 (unpack_value_fields)       15732776: 5.4%          0: 0.0%   45589448: 2.1%     292720: 0.3%     157392
tree-streamer-in.c:562 (streamer_alloc_tree)         300256: 0.1%  241332496: 5.7%   48823360: 2.2%      13216: 0.0%    2903028
lto/lto.c:2711 (create_subid_section_table)         1939520: 0.7%          0: 0.0%   49182144: 2.2%   10096128:10.9%       5008
ipa-inline-analysis.c:3832 (inline_read_section)          0: 0.0%   51338084: 1.2%   55137448: 2.5%    1313100: 1.4%     416983
ipa-inline-analysis.c:942 (inline_summary_alloc)          0: 0.0%          0: 0.0%   67108920: 3.0%         56: 0.0%          1
toplev.c:960 (realloc_for_line_map)                       0: 0.0%   22493304: 0.5%   67239960: 3.0%        144: 0.0%         14
vec.h:792 (vec_safe_copy)                           1227184: 0.4%  117348220: 2.8%   94975608: 4.3%    5710316: 6.2%     933481
cgraph.c:840 (cgraph_create_edge_1)                       0: 0.0%          0: 0.0%  115711128: 5.2%          0: 0.0%    1112607
lto/lto.c:240 (lto_read_in_decl_state)              1103456: 0.4%          0: 0.0%  162786024: 7.4%   30006496:32.4%    2264577
cgraph.c:499 (cgraph_allocate_node)                       0: 0.0%          0: 0.0%  207403696: 9.4%          0: 0.0%     682249
tree-streamer-in.c:573 (streamer_alloc_tree)       75348944:25.9% 3354555968:78.9% 1038361448:47.0%   36020064:38.9%   35699328
Total                                             290565511       4250101446       2211224488         92491589         56817662
source location                                     Garbage            Freed             Leak         Overhead            Times
Comment 189 Martin Liška 2013-08-21 09:02:44 UTC
I've encountered problems connected with PGO:

gcc revision: 201894
firefox changeset:  143205:1d6bf2bd4003 (Aug 20, 2013)

I build instrumented binary without LTO and after that I use the profile for LTO:
MYFLAGS="-flto=9 -fno-fat-lto-objects -ftoplevel-reorder -fprofile-use -Wno-error=coverage-mismatch"

I know that there are gcda files that are mentioned in this thread and were removed by me:

jemalloc.gcda (makes sense)
ptsynch.gcda (likewise)

HashFunctions.gcda (?)
sqlite3.gcda (?)

After linking of sqlite3, there are many corrupted profiles like:
/ssd/firefox/js/src/gc/Marking.cpp
/ssd/firefox/js/src/frontend/BytecodeEmitter.cpp
/ssd/firefox/js/src/frontend/Interpreter.cpp
...

Example of an error:
/ssd/firefox/js/src/gc/Marking.cpp: In function ‘js::gc::IsAboutToBeFinalized<JSAtom>(JSAtom**)bool [clone .isra.65]’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
 }
 ^
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-6 thought to be -81
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 39667
/ssd/firefox/js/src/gc/Marking.cpp: In function ‘js::gc::IsAboutToBeFinalized<js::UnownedBaseShape>(js::UnownedBaseShape**)bool [clone .isra.52]’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-6 thought to be -1
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 41156
/ssd/firefox/js/src/gc/Marking.cpp: In function ‘MarkInternal<JSAtom>(JSTracer*, JSAtom**)void’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 9-14 thought to be -39
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 9-10 thought to be 180119
/ssd/firefox/js/src/gc/Marking.cpp: In function ‘MarkInternal<JSObject>(JSTracer*, JSObject**)void’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 11-18 thought to be -1
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 11-12 thought to be 49007
/ssd/firefox/js/src/gc/Marking.cpp: In member function ‘js::MarkStack<unsigned long>::push(unsigned long)’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 4-6 thought to be -1
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 4-5 thought to be 1
/ssd/firefox/js/src/gc/Marking.cpp: In member function ‘js::GCMarker::drainMarkStack(js::SliceBudget&)’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-4 thought to be -7
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-1 thought to be 7
/ssd/firefox/js/src/gc/Marking.cpp: In member function ‘js::ObjectImpl::slotSpan() const’:
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 5-7 thought to be -1
/ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 5-6 thought to be 15965

Thank you,
Martin
Comment 190 Jan Hubicka 2013-08-21 13:01:18 UTC
> /ssd/firefox/js/src/gc/Marking.cpp: In function
> ???js::gc::IsAboutToBeFinalized<JSAtom>(JSAtom**)bool [clone .isra.65]???:
> /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info:
> profile data is not flow-consistent
>  }
>  ^
> /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info:
> number of executions for edge 3-6 thought to be -81

This actually loks like corruption from concurent updates (profiling is not thread
safe).  Do you get much more of these?
I can imagine that garbage collector runs in parrallel and often.
> /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info:
> number of executions for edge 3-4 thought to be 39667

Perhaps we should fix dumping to dump full 64bit value.. :)

Honza
Comment 191 Markus Trippelsdorf 2013-08-29 20:19:41 UTC
First of all many thanks for your work on reducing memory usage.
Peak memory usage is now lower (~3GB) than clang's (~4GB).

However, with -enable-optimize=-O3 on rev202079 I get:
(An default (-Os) build on rev202053 went fine this morning)

/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans0.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN17nsHtt
pTransaction18ReadRequestSegmentEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans0.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN17nsHtt
pTransaction18ReadRequestSegmentEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN16nsInp
utStreamTee15WriteSegmentFunEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans24.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN16nsIn
putStreamTee15WriteSegmentFunEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC
/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: read-only segment has dynamic relocations
/tmp/ccd3grW1.ltrans0.ltrans.o:ccd3grW1.ltrans0.o:function nsHttpTransaction::ReadSegments(nsAHttpSegmentReader*, unsigned int, unsigned int*): error: undefined reference to 
'nsHttpTransaction::ReadRequestSegment(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)'
/tmp/ccd3grW1.ltrans0.ltrans.o:ccd3grW1.ltrans0.o:function nsHttpConnection::OnSocketWritable(): error: undefined reference to 'nsHttpTransaction::ReadRequestSegment(nsIInput
Stream*, void*, char const*, unsigned int, unsigned int, unsigned int*)'
/tmp/ccd3grW1.ltrans0.ltrans.o:ccd3grW1.ltrans0.o:function nsHttpPipeline::ReadSegments(nsAHttpSegmentReader*, unsigned int, unsigned int*): error: undefined reference to 'ns
HttpPipeline::ReadFromPipe(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)'
/tmp/ccd3grW1.ltrans1.ltrans.o:ccd3grW1.ltrans1.o:function imgRequest::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned long, unsigned int): error: undefi
ned reference to 'nsInputStreamTee::WriteSegmentFun(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)'
/tmp/ccd3grW1.ltrans24.ltrans.o:ccd3grW1.ltrans24.o:function nsInputStreamTee::ReadSegments(tag_nsresult (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, 
unsigned int*), void*, unsigned int, unsigned int*): error: undefined reference to 'nsInputStreamTee::WriteSegmentFun(nsIInputStream*, void*, char const*, unsigned int, unsig
ned int, unsigned int*)'

Not sure if -O3 or rev202079 is to blame.
Comment 192 Markus Trippelsdorf 2013-08-29 21:51:17 UTC
It turned out that -enable-optimize=-O3 is the cause.
Rev202079 with -Os links fine.
Comment 193 Jan Hubicka 2013-09-03 14:38:59 UTC
I am building firefox with -O3 and get no undefined symbols.  Can you, please, relink with -Wl,--no-demangle --save-temps -fdump-ipa-all and try to look up the missing symbol in -lm.res file and if it not UNDEF there make somewhere available the dumps?
If it is undefined there, it may be firefox bug..
Comment 194 Markus Trippelsdorf 2013-09-03 17:22:35 UTC
(In reply to Jan Hubicka from comment #193)
> I am building firefox with -O3 and get no undefined symbols.  Can you,
> please, relink with -Wl,--no-demangle --save-temps -fdump-ipa-all and try to
> look up the missing symbol in -lm.res file and if it not UNDEF there make
> somewhere available the dumps?
> If it is undefined there, it may be firefox bug..

Hmm, it's strange, because there are five undefined references;
one of them does not appear in lm.res at all and the other four 
are all PREVAILING_DEF_IRONLY.
(The whole dump is huge. Please tell me which part you need and
I will try to upload it somewhere.)
Comment 195 Jan Hubicka 2013-09-05 23:07:57 UTC
Today there was two fixes for bugs that produce undefined symbols like one you see.
Does the problem still exist on current mainline?  Are you using profile feedback?
Comment 196 Markus Trippelsdorf 2013-09-06 07:27:56 UTC
(In reply to Jan Hubicka from comment #195)
> Today there was two fixes for bugs that produce undefined symbols like one
> you see.
> Does the problem still exist on current mainline?  Are you using profile
> feedback?

The problem is gone on current mainline. (And yes I'm using profile feedback.)
Comment 197 Markus Trippelsdorf 2014-01-17 19:05:18 UTC
Created attachment 31876 [details]
mozilla-central patch
Comment 198 Markus Trippelsdorf 2014-01-17 19:06:39 UTC
Created attachment 31877 [details]
My local PGO/LTO script
Comment 199 Markus Trippelsdorf 2014-01-17 19:07:39 UTC
Created attachment 31878 [details]
.mozconfig_profile_gen
Comment 200 Martin Jambor 2014-03-06 17:08:00 UTC
I currently cannot build Firefox with LTO due to PR 60449 (yeah, I
know, using gcc configured with checking makes life hard, sometimes
unnecessarily).

I get errors like
 /home/mjambor/mozilla/mzc2/media/libvpx/vp8/encoder/onyx_if.c:4884:5: error: control flow  in the middle of basic block 7
Comment 201 Markus Trippelsdorf 2014-03-06 17:28:15 UTC
With current gcc trunk and mozilla-central trunk Firefox crashes on startup when
build with -flto (--enable-optimize=-O3):

0x00007ffff5ce5d8f in nsCOMPtr_base::assign_with_AddRef(nsISupports*) [clone .constprop.13162] () from /var/tmp/moz-build-dir/dist/bin/libxul.so
(gdb) bt
#0  0x00007ffff5ce5d8f in nsCOMPtr_base::assign_with_AddRef(nsISupports*) [clone .constprop.13162] () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#1  0x00007ffff3fe60eb in nsSocketTransport::OnSocketDetached(PRFileDesc*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#2  0x00007ffff3eb74ac in nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext*, nsSocketTransportService::SocketContext*) ()
   from /var/tmp/moz-build-dir/dist/bin/libxul.so
#3  0x00007ffff3fff28f in nsSocketTransportService::Run() () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#4  0x00007ffff4059c6a in nsThread::ProcessNextEvent(bool, bool*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#5  0x00007ffff5ce5b39 in NS_ProcessNextEvent(nsIThread*, bool) [clone .constprop.13167] () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#6  0x00007ffff45af7a0 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#7  0x00007ffff3ec649d in MessageLoop::Run() () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#8  0x00007ffff3fe7a56 in nsThread::ThreadFunc(void*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so
#9  0x00007ffff7e7757c in _pt_root () from /var/tmp/moz-build-dir/dist/bin/libnspr4.so
#10 0x00007ffff7bc41e2 in start_thread () from /lib/libpthread.so.0
#11 0x00007ffff74932ad in clone () from /lib/libc.so.6

When I build with PGO/LTO Firefox crashes later (when I close a
tab with e.g.: https://github.com/JuliaLang/julia/pull/6018 ):

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff51645ed in PL_DHashTableEnumerate(PLDHashTable*, PLDHashOperator (*)(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*), void*) ()
   from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
(gdb) bt
#0  0x00007ffff51645ed in PL_DHashTableEnumerate(PLDHashTable*, PLDHashOperator (*)(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*), void*) ()
   from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#1  0x00007ffff5754d32 in PresShell::Destroy() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#2  0x00007ffff5754831 in nsDocumentViewer::DestroyPresShell() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#3  0x00007ffff55ee5c4 in nsDocumentViewer::Hide() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#4  0x00007ffff57b72eb in nsDocShell::SetVisibility(bool) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#5  0x00007ffff5a589a4 in nsFrameLoader::Hide() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#6  0x00007ffff5a588f6 in nsHideViewer::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#7  0x00007ffff53b97de in nsContentUtils::RemoveScriptBlocker() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#8  0x00007ffff53cc954 in nsDocument::EndUpdate(unsigned int) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#9  0x00007ffff5651dd6 in mozilla::dom::XULDocument::EndUpdate(unsigned int) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#10 0x00007ffff549673b in nsINode::doRemoveChildAt(unsigned int, bool, nsIContent*, nsAttrAndChildArray&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#11 0x00007ffff5496085 in nsXULElement::RemoveChildAt(unsigned int, bool) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#12 0x00007ffff5494df9 in nsINode::RemoveChild(nsINode&, mozilla::ErrorResult&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#13 0x00007ffff5494a00 in mozilla::dom::NodeBinding::removeChild(JSContext*, JS::Handle<JSObject*>, nsINode*, JSJitMethodCallArgs const&) [clone .lto_priv.13709] ()
   from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#14 0x00007ffff53b01e7 in mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#15 0x00007ffff5262744 in js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#16 0x00007ffff524a14c in Interpret(JSContext*, js::RunState&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#17 0x00007ffff5249801 in js::RunScript(JSContext*, js::RunState&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#18 0x00007ffff52627ec in js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#19 0x00007ffff52a574c in js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) ()
   from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#20 0x00007ffff55c553d in nsJSEventListener::HandleEvent(nsIDOMEvent*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#21 0x00007ffff5869106 in nsXBLPrototypeHandler::ExecuteHandler(mozilla::dom::EventTarget*, nsIDOMEvent*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#22 0x00007ffff5868554 in nsXBLEventHandler::HandleEvent(nsIDOMEvent*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#23 0x00007ffff5402b6c in nsEventListenerManager::HandleEventInternal(nsPresContext*, mozilla::WidgetEvent*, nsIDOMEvent**, mozilla::dom::EventTarget*, nsEventStatus*) ()
   from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#24 0x00007ffff53c38b2 in nsEventTargetChainItem::HandleEventTargetChain(nsTArray<nsEventTargetChainItem>&, nsEventChainPostVisitor&, nsDispatchingCallback*, ELMCreationDetector&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#25 0x00007ffff53c1fe7 in nsEventDispatcher::Dispatch(nsISupports*, nsPresContext*, mozilla::WidgetEvent*, nsIDOMEvent*, nsEventStatus*, nsDispatchingCallback*, nsCOMArray<mozilla::dom::EventTarget>*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#26 0x00007ffff5a686c5 in nsTransitionManager::FlushTransitions(mozilla::css::CommonAnimationManager::FlushFlags) ()
   from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#27 0x00007ffff563309f in nsRefreshDriver::Tick(long, mozilla::TimeStamp) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#28 0x00007ffff56325ac in mozilla::RefreshDriverTimer::TimerTick(nsITimer*, void*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#29 0x00007ffff54a32f7 in nsTimerEvent::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#30 0x00007ffff5166651 in nsThread::ProcessNextEvent(bool, bool*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#31 0x00007ffff5627914 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#32 0x00007ffff5146183 in MessageLoop::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#33 0x00007ffff562770a in nsBaseAppShell::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#34 0x00007ffff56276be in nsAppStartup::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#35 0x00007ffff5136f58 in XRE_main () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so
#36 0x000000000040aa58 in do_main(int, char**, nsIFile*) [clone .lto_priv.18] ()
#37 0x000000000040a285 in main ()

A "vanilla" build without PGO or LTO runs fine.
Comment 202 H.J. Lu 2014-03-06 18:00:46 UTC
LTO miscompiles 435.gromacs in SPEC CPU 2006 on x32 with

-mx32 -O3 -funroll-loops -ffast-math

since r208165 (PR 60418).  Can you try r208163?
Comment 203 Markus Trippelsdorf 2014-03-06 19:06:00 UTC
(In reply to H.J. Lu from comment #202)
> LTO miscompiles 435.gromacs in SPEC CPU 2006 on x32 with
> 
> -mx32 -O3 -funroll-loops -ffast-math
> 
> since r208165 (PR 60418).  Can you try r208163?

Yes. Unfortunately with r208163 Firefox still crashes on startup.
Comment 204 Markus Trippelsdorf 2014-03-29 17:09:33 UTC
Here is a comparison of libxul sizes (in bytes, unstripped) for different
compiler options:

gcc (trunk):
-O3             90213016
-O3 -flto       79682648
-O3 -flto / PGO 77250512
-Os             70431584
-Os -flto       62474008

clang (trunk):
-O3             80574784
-O3 -flto       79394992
-Os             72452776
-Os -flto       65111640
Comment 205 Jan Hubicka 2014-03-31 03:24:39 UTC
I was looking into this recently, too.  Curiously enough, for me clang+LTO was winning
but comparing the symbols it seemed that the confiugre scripts picked bit more features
at GCC side.  I looked briefly on the differences and we can optimize out more vtables
which I have patch for pending for next stage1 and optimize out write only global vars.
Still the differences may be worth further investigation - clang seems to produce noticeably
fewer external relocations, too. This seems like a ABI bug at clang side though.

What I use for my firefox builds is --param inline-unit-growth=5.  Our -O3 seems bit
of overkill for applicatin of fize of Firefox...

Honza
Comment 206 Martin Liška 2014-04-02 16:24:49 UTC
Firefox (and chromium) memory reports with -flto=9 and -O2; archive contains also memory usage graph:

https://docs.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit
Comment 207 Martin Liška 2014-04-02 16:25:53 UTC
Created attachment 32525 [details]
Memory usage graphs for -flto=9, -flto=4, -flto=1 with -O2
Comment 208 Markus Trippelsdorf 2014-04-08 08:13:28 UTC
Both issues from Comment 201 were fixed by:
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00338.html
Comment 209 Markus Trippelsdorf 2014-04-09 12:36:20 UTC
(In reply to Markus Trippelsdorf from comment #208)
> Both issues from Comment 201 were fixed by:
> http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00338.html

No, only the first issue is fixed. The second one (LTO/PGO build)
still happens unfortunately.
Comment 210 Steffen Hau 2014-05-23 13:48:52 UTC
Latest firefox 29.0.1 does not compile with LTO enabled (Gentoo/GCc 4.9.0). It fails in elfhack:

make[5]: Entering directory '/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack'
elfhack
/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/_virtualenv/bin/python /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/config/expandlibs_exec.py --depend .deps/elfhack.pp --target elfhack -- x86_64-pc-linux-gnu-g++ -o elfhack -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -mno-avx -std=gnu++0x -MD -MP -MF .deps/elfhack.pp -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags host_elf.o host_elfhack.o  
x86_64-pc-linux-gnu-gcc -o dummy dummy.o -lpthread -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags -Wl,-z,noexecstack -Wl,-z,text  -Wl,-rpath-link,/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/dist/bin -Wl,-rpath-link,/usr/lib 
x86_64-pc-linux-gnu-g++  -Wall -Wpointer-arith -Woverloaded-virtual -Werror=return-type -Werror=int-to-pointer-cast -Wtype-limits -Wempty-body -Wsign-compare -Wno-invalid-offsetof -Wcast-align -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -mno-avx -fno-strict-aliasing -fno-rtti -fno-math-errno -std=gnu++0x -pthread -pipe -fexceptions  -DNDEBUG -DTRIMMED -O2 -fomit-frame-pointer -fPIC -shared -Wl,-z,defs -Wl,-h,test-array.so -o test-array.so -lpthread -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags -Wl,-z,noexecstack -Wl,-z,text  -Wl,-rpath-link,/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/dist/bin -Wl,-rpath-link,/usr/lib  test-array.o -nostartfiles
x86_64-pc-linux-gnu-g++  -Wall -Wpointer-arith -Woverloaded-virtual -Werror=return-type -Werror=int-to-pointer-cast -Wtype-limits -Wempty-body -Wsign-compare -Wno-invalid-offsetof -Wcast-align -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -mno-avx -fno-strict-aliasing -fno-rtti -fno-math-errno -std=gnu++0x -pthread -pipe -fexceptions  -DNDEBUG -DTRIMMED -O2 -fomit-frame-pointer -fPIC -shared -Wl,-z,defs -Wl,-h,test-ctors.so -o test-ctors.so -lpthread -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags -Wl,-z,noexecstack -Wl,-z,text  -Wl,-rpath-link,/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/dist/bin -Wl,-rpath-link,/usr/lib  test-ctors.o -nostartfiles
===
=== If you get failures below, please file a bug describing the error
=== and your environment (compiler and linker versions), and use
=== --disable-elf-hack until this is fixed.
===
# Fail if the library doesn't have INIT .dynamic info
readelf -d test-ctors.so | grep '(INIT)'
 0x000000000000000c (INIT)               0x0
/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/elfhack -b -f test-ctors.so
===
=== If you get failures below, please file a bug describing the error
=== and your environment (compiler and linker versions), and use
=== --disable-elf-hack until this is fixed.
===
# Fail if the library doesn't have INIT_ARRAY .dynamic info
test-ctors.so: Reduced by 12096 bytes
readelf -d test-array.so | grep '(INIT_ARRAY)'
# Fail if the backup file doesn't exist
[ -f 'test-ctors.so.bak' ]
 0x0000000000000019 (INIT_ARRAY)         0x9790
# Fail if the new library doesn't contain less relocations
/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/elfhack -b -f test-array.so
test-array.so: [ $(objdump -R test-ctors.so.bak | wc -l) -gt $(objdump -R test-ctors.so | wc -l) ]
Reduced by 12088 bytes
# Fail if the backup file doesn't exist
[ -f 'test-array.so.bak' ]
# Fail if the new library doesn't contain less relocations
[ $(objdump -R test-array.so.bak | wc -l) -gt $(objdump -R test-array.so | wc -l) ]
# Will either crash or return exit code 1 if elfhack is broken
LD_PRELOAD=/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/test-array.so /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/dummy
PASS
LD_PRELOAD=/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/test-ctors.so /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/dummy
FAIL
Makefile:52: recipe for target 'libs' failed
make[5]: *** [libs] Error 1
make[5]: Leaving directory '/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack'


Disabling LTO let firefox successfully compile.
Comment 211 Jan Hubicka 2014-05-24 21:47:37 UTC
Elfhack is rather sensitive to LTO, but it works for me, so this seems like binutils issue or some elfhack change that happened recently.
I wrote instructions for building firefox with LTO here
http://hubicka.blogspot.ca/2014/04/linktime-optimization-in-gcc-2-firefox.html

Here I am attaching -ftime-report after the symtab hashtable was removed
Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    1536 kB ( 0%) ggc
 phase opt and generate  :  54.29 (58%) usr   1.28 (18%) sys  55.58 (50%) wall  720779 kB (18%) ggc
 phase stream in         :  33.54 (36%) usr   1.84 (26%) sys  35.39 (32%) wall 3389310 kB (82%) ggc
 phase stream out        :   6.00 ( 6%) usr   4.02 (56%) sys  19.99 (18%) wall       0 kB ( 0%) ggc
 garbage collection      :   1.86 ( 2%) usr   0.00 ( 0%) sys   1.86 ( 2%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.23 ( 0%) usr   0.00 ( 0%) sys   0.24 ( 0%) wall       9 kB ( 0%) ggc
 ipa dead code removal   :   5.70 ( 6%) usr   0.18 ( 3%) sys   6.15 ( 6%) wall      92 kB ( 0%) ggc
 ipa inheritance graph   :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall     883 kB ( 0%) ggc
 ipa virtual call target :   5.58 ( 6%) usr   0.06 ( 1%) sys   5.32 ( 5%) wall       0 kB ( 0%) ggc
 ipa devirtualization    :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) wall    9201 kB ( 0%) ggc
 ipa cp                  :   2.34 ( 2%) usr   0.21 ( 3%) sys   2.55 ( 2%) wall  223628 kB ( 5%) ggc
 ipa inlining heuristics :  26.97 (29%) usr   0.67 ( 9%) sys  27.66 (25%) wall  865791 kB (21%) ggc
 ipa comdats             :   0.21 ( 0%) usr   0.00 ( 0%) sys   0.21 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   0.07 ( 0%) usr   0.11 ( 2%) sys   0.21 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :   0.46 ( 0%) usr   0.19 ( 3%) sys   0.65 ( 1%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  24.76 (26%) usr   1.28 (18%) sys  26.08 (23%) wall 2571773 kB (63%) ggc
 ipa lto decl out        :   5.45 ( 6%) usr   0.28 ( 4%) sys   5.75 ( 5%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.13 ( 1%) usr   0.24 ( 3%) sys   1.38 ( 1%) wall  414551 kB (10%) ggc
 ipa lto decl merge      :   2.57 ( 3%) usr   0.01 ( 0%) sys   2.58 ( 2%) wall    8227 kB ( 0%) ggc
 ipa lto cgraph merge    :   1.72 ( 2%) usr   0.00 ( 0%) sys   1.72 ( 2%) wall   12166 kB ( 0%) ggc
 whopr wpa               :   1.04 ( 1%) usr   0.00 ( 0%) sys   1.04 ( 1%) wall       2 kB ( 0%) ggc
 whopr wpa I/O           :   0.03 ( 0%) usr   3.55 (50%) sys  13.51 (12%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   4.97 ( 5%) usr   0.06 ( 1%) sys   5.02 ( 5%) wall    3738 kB ( 0%) ggc
 ipa reference           :   3.62 ( 4%) usr   0.12 ( 2%) sys   3.75 ( 3%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.33 ( 0%) usr   0.01 ( 0%) sys   0.33 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.86 ( 4%) usr   0.01 ( 0%) sys   3.88 ( 3%) wall       0 kB ( 0%) ggc
 tree eh                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG cleanup        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.05 ( 0%) usr   0.16 ( 2%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.65 ( 1%) usr   0.00 ( 0%) sys   0.64 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 :  93.84             7.14           110.98            4111626 kB

there are some improvements in devirtualization performance that used quite few decl->symbol lookups. (about 20%)
Comment 212 Steffen Hau 2014-05-26 09:47:28 UTC
Hi Jan,

I have binutils version 2.24 with the patch from Markus Trippelsdorf for early plugin loading, so I have no wrappers for ar, nm and ranlib. I've also symlinked the liblto_plugin.so in binutils bfd-plugins directory. I'll try to apply the 3 patches you mentioned in your blog post and see wether they help, but I think they are not relevant for elfhack portion which is failing on my system.

Which firefox version did you successfully compile?
Comment 213 Steffen Hau 2014-08-26 13:13:34 UTC
Hi Jan,

just a short Update: Firefox since version 30 as well as Thunderbird since version 31 both compile fine with LTO enabled without the need of any additional patches. The package size was reduced by 51% (firefox ~420MB -> ~207MB) and 59% (thunderbird ~480MB -> ~200MB). Both programs work as intended, no crashes or unexpected behaviour so far.

Best regards,
Steffen
Comment 214 Martin Liška 2014-11-13 16:25:22 UTC
I've just found ICE for r217480 with LTO and -O2:

lto1: internal compiler error: in lto_output_node, at lto-cgraph.c:462
0x7ce411 lto_output_node
	../../gcc/lto-cgraph.c:462
0x7ce411 output_symtab()
	../../gcc/lto-cgraph.c:974
0x7db276 lto_output()
	../../gcc/lto-streamer-out.c:2309
0x814671 write_lto
	../../gcc/passes.c:2346
0x8177c1 ipa_write_optimization_summaries(lto_symtab_encoder_d*)
	../../gcc/passes.c:2545
0x59512a do_stream_out
	../../gcc/lto/lto.c:2475
0x59a41f stream_out
	../../gcc/lto/lto.c:2538
0x59a41f lto_wpa_write_files
	../../gcc/lto/lto.c:2655
0x59a41f do_whole_program_analysis
	../../gcc/lto/lto.c:3323
0x59a41f lto_main()
	../../gcc/lto/lto.c:3443

  if (tag == LTO_symtab_analyzed_node)
    gcc_assert (clone_of || !node->clone_of);
~~~~^
  if (!clone_of)
    streamer_write_hwi_stream (ob->main_stream, LCC_NOT_FOUND);
  else
    streamer_write_hwi_stream (ob->main_stream, ref);

If needed I will try to reduce objects that are part of WPA phase.

Martin
Comment 215 Jan Hubicka 2015-01-19 23:58:51 UTC
Author: hubicka
Date: Mon Jan 19 23:58:19 2015
New Revision: 219871

URL: https://gcc.gnu.org/viewcvs?rev=219871&root=gcc&view=rev
Log:

	PR lto/45375
	* i386.c (gate): Check flag_expensive_optimizations and
	optimize_size.
	(ix86_option_override_internal): Drop optimize_size condition
	on MASK_ACCUMULATE_OUTGOING_ARGS, MASK_VZEROUPPER,
	MASK_AVX256_SPLIT_UNALIGNED_LOAD, MASK_AVX256_SPLIT_UNALIGNED_STORE,
	MASK_PREFER_AVX128.
	(ix86_avx256_split_vector_move_misalign,
	ix86_avx256_split_vector_move_misalign): Check optimize_insn_for_speed.
	* sse.md (all uses of TARGET_PREFER_AVX128): Add
	optimize_insn_for_speed_p check.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sse.md
Comment 216 Jan Hubicka 2015-01-20 04:40:18 UTC
Author: hubicka
Date: Tue Jan 20 04:39:45 2015
New Revision: 219878

URL: https://gcc.gnu.org/viewcvs?rev=219878&root=gcc&view=rev
Log:

	PR lto/45375
	* i386.c (ix86_option_override_internal): Use ix86_tune_cost
	to set branch cost.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
Comment 217 Jan Hubicka 2015-01-20 19:49:32 UTC
Author: hubicka
Date: Tue Jan 20 19:48:59 2015
New Revision: 219909

URL: https://gcc.gnu.org/viewcvs?rev=219909&root=gcc&view=rev
Log:

	PR lto/45375
	* ipa-inline.c: Include lto-streamer.h
	(report_inline_failed_reason): Output source file differences and
	flags on optimization/target node mismatch.
	(can_inline_edge_p): Consider caller to be the outer inline function;
	be less restrictive about matching opimize and optimize_size attributes.
	(inline_account_function_p): Break out from ...
	(inline_small_functions): ... here.
	* ipa-inline-transform.c (clone_inlined_nodes): Use
	inline_account_function_p.
	(inline_call): Use optimize attribution; use inline_account_function_p.
	(inline_transform): Use opt_for_fn.
	* ipa-inline.h (inline_account_function_p): Declare.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-transform.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
Comment 218 Martin Liška 2016-01-08 13:24:23 UTC
Hi.

Building Firefox revision:
commit a704d34fb1f9e0f5dbf4113298d885cdb650906c
Author: Matthew Noorenberghe <mozilla@noorenberghe.ca>
Date:   Thu Dec 3 17:33:35 2015 -0800

    Bug 1230391 - Disable password visibility toggling in the capture doorhanger outside Nightly. rs=bnicholson, a=lizzard on a CLOSED TREE
    
    --HG--
    extra : source : aea828e2cdf767a358ebc6ea661dd3b9b4160321
    extra : intermediate-source : 366dd290472633b06f0942d7737c34e942e0916a

This is a minimal set of LTO options for which the built binary can run:
MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-devirtualize"

For more details:
# MYFLAGS="$OPT -march=native -flto=9" FAILED
# MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-delete-null-pointer-checks -fno-devirtualize -fno-strict-aliasing" OK
# MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-delete-null-pointer-checks" FAILED
# MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-delete-null-pointer-checks -fno-devirtualize" OK
# MYFLAGS="$OPT -march=native -flto=9 -fno-devirtualize" FAILED
# MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-devirtualize" OK
# MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse" FAILED

Martin
Comment 219 Jan Hubicka 2016-01-18 17:47:30 UTC
devirtualization issue is now fixed, so we are down to -fno-lifetime-dse.
Comment 220 Martin Liška 2020-07-07 10:41:03 UTC
Comparing Firefox and Chromium builds with LTO for GCC 9 and GCC 10 are here:
https://gist.github.com/marxin/223890df4d8d8e490b6b2918b77dacad

We have a serious regression in WPA time in between GCC 9 and GCC 10.
Comment 221 Martin Liška 2020-07-07 11:09:57 UTC
For the chromium with GCC 10, inliner starts after ~5 minutes, so it's very likely inliner that takes so long.
Comment 222 Martin Liška 2020-07-07 11:16:00 UTC
(In reply to Martin Liška from comment #221)
> For the chromium with GCC 10, inliner starts after ~5 minutes, so it's very
> likely inliner that takes so long.

  45.07%  libc-2.31.so  [.] __memset_avx2_erms
  21.79%  [kernel]      [k] change_protection_range
   3.74%  lto1          [.] fibonacci_heap<sreal, cgraph_edge>::consolidate
   3.54%  lto1          [.] fibonacci_heap<sreal, cgraph_edge>::extract_minimum_node
   2.63%  [kernel]      [k] task_numa_work
Comment 223 Martin Liška 2020-07-25 12:23:20 UTC
(In reply to Martin Liška from comment #222)
> (In reply to Martin Liška from comment #221)
> > For the chromium with GCC 10, inliner starts after ~5 minutes, so it's very
> > likely inliner that takes so long.
> 
>   45.07%  libc-2.31.so  [.] __memset_avx2_erms
>   21.79%  [kernel]      [k] change_protection_range
>    3.74%  lto1          [.] fibonacci_heap<sreal, cgraph_edge>::consolidate
>    3.54%  lto1          [.] fibonacci_heap<sreal,
> cgraph_edge>::extract_minimum_node
>    2.63%  [kernel]      [k] task_numa_work

Suggested patch for it:
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550662.html
Comment 224 GCC Commits 2020-07-27 07:16:29 UTC
The master branch has been updated by Martin Liska <marxin@gcc.gnu.org>:

https://gcc.gnu.org/g:7f5c0f328eced560a204bb8e3eae0d45795dd235

commit r11-2338-g7f5c0f328eced560a204bb8e3eae0d45795dd235
Author: Martin Liska <mliska@suse.cz>
Date:   Fri Jul 24 14:33:27 2020 +0200

    Use vec::reserve before vec_safe_grow_cleared is called
    
    gcc/ChangeLog:
    
            PR lto/45375
            * symbol-summary.h: Call vec_safe_reserve before grow is called
            in order to grow to a reasonable size.
            * vec.h (vec_safe_reserve): Add missing function for vl_ptr
            type.
Comment 225 GCC Commits 2020-07-27 10:33:19 UTC
The releases/gcc-10 branch has been updated by Martin Liska <marxin@gcc.gnu.org>:

https://gcc.gnu.org/g:f93ce9ea23e1806ccf9d8cd1640fc14596f54be8

commit r10-8537-gf93ce9ea23e1806ccf9d8cd1640fc14596f54be8
Author: Martin Liska <mliska@suse.cz>
Date:   Fri Jul 24 14:33:27 2020 +0200

    Use vec::reserve before vec_safe_grow_cleared is called
    
    gcc/ChangeLog:
    
            PR lto/45375
            * symbol-summary.h: Call vec_safe_reserve before grow is called
            in order to grow to a reasonable size.
            * vec.h (vec_safe_reserve): Add missing function for vl_ptr
            type.
    
    (cherry picked from commit 7f5c0f328eced560a204bb8e3eae0d45795dd235)