Bug 96913 - [10 regression] gcov TOPN streaming is incorrect for shared libraries
Summary: [10 regression] gcov TOPN streaming is incorrect for shared libraries
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: gcov-profile (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: 10.3
Assignee: Sergei Trofimovich
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-03 07:39 UTC by Sergei Trofimovich
Modified: 2020-10-02 12:32 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 9.3.0
Known to fail: 10.2.0
Last reconfirmed: 2020-09-06 00:00:00


Attachments
_json.gcda (1.39 KB, application/octet-stream)
2020-09-03 07:41 UTC, Sergei Trofimovich
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sergei Trofimovich 2020-09-03 07:39:52 UTC
The hang happens on real tauthon-2.8.2 interpreter from PR96394 (no nice reproducer yet).

In this instance I tried to build tauthon-2.8.2 against gcc-master. It hangs early when tries to merge topn entry:

"""
#0  0x00007fd73865e0ce in __GI___libc_read (fd=3, buf=0x406284 <__gcov_var+36>, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007fd7385f0090 in __GI__IO_file_xsgetn (fp=0x1295bc0, data=<optimized out>, n=4096) at libioP.h:948
#2  0x00007fd7385e4d1f in __GI__IO_fread (buf=0x406284 <__gcov_var+36>, size=1, count=4096, fp=0x1295bc0) at iofread.c:38
#3  0x00007fd72ab4237e in gcov_read_words (words=2) at ../../../gcc/libgcc/../gcc/gcov-io.c:491
#4  0x00007fd72ab42483 in __gcov_read_counter () at ../../../gcc/libgcc/../gcc/gcov-io.c:528
#5  0x00007fd72ab4190e in gcov_get_counter_target () at ../../../gcc/libgcc/libgcov.h:383
#6  0x00007fd72ab41c6b in __gcov_merge_topn (counters=0x7fd72ab4d320 <__gcov4.encoder_clear>, n_counters=24) at ../../../gcc/libgcc/libgcov-merge.c:114
#7  0x00007fd72ab43569 in merge_one_data (
    filename=0x1154290 "/tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/build/temp.linux-x86_64-2.8/tmp/portage/dev-lang/tauthon-2.8.2/work/tauthon-2.8.2/Modules/_json.gcda", gi_ptr=0x7fd72ab4a180, summary=0x7ffde7d6e9c0) at ../../../gcc/libgcc/libgcov-driver.c:314
#8  0x00007fd72ab43b1a in dump_one_gcov (gi_ptr=0x7fd72ab4a180, gf=0x7ffde7d6ea00, run_counted=0, run_max=125)
    at ../../../gcc/libgcc/libgcov-driver.c:492
#9  0x00007fd72ab43cba in gcov_do_dump (list=0x7fd72ab4a180, run_counted=0) at ../../../gcc/libgcc/libgcov-driver.c:555
#10 0x00007fd72ab43d28 in __gcov_dump_one (root=0x7fd72ab4e5c0 <__gcov_root>) at ../../../gcc/libgcc/libgcov-driver.c:578
#11 0x00007fd72ab43d5d in __gcov_exit () at ../../../gcc/libgcc/libgcov-driver.c:600
#12 0x00007fd738c460a3 in _dl_fini () at dl-fini.c:139
#13 0x00007fd7385af564 in __run_exit_handlers (status=0, listp=0x7fd73872a738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
    run_dtors=run_dtors@entry=true) at exit.c:108
#14 0x00007fd7385af70a in __GI_exit (status=<optimized out>) at exit.c:139
#15 0x00007fd73899534e in Py_Exit () from /tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/libtauthon2.8.so.1.0
#16 0x00007fd738995533 in handle_system_exit () from /tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/libtauthon2.8.so.1.0
#17 0x00007fd73899588e in PyErr_PrintEx () from /tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/libtauthon2.8.so.1.0
#18 0x00007fd738995e44 in PyErr_Print () from /tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/libtauthon2.8.so.1.0
#19 0x00007fd7389b851e in RunModule () from /tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/libtauthon2.8.so.1.0
#20 0x00007fd7389b9075 in Py_Main () from /tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/libtauthon2.8.so.1.0
#21 0x00000000004012ac in main ()
"""

Looks like the problem is in decoding of some tag in __gcov_merge_topn(). There count of entries is huge:

"""
(gdb) frame 6
#6  0x00007fd72ab41c6b in __gcov_merge_topn (counters=0x7fd72ab4d320 <__gcov4.encoder_clear>, n_counters=24) at ../../../gcc/libgcc/libgcov-merge.c:114
114               gcov_type value = gcov_get_counter_target ();
(gdb) list __gcov_merge_topn
96         -- counter
97         */
98
99      void
100     __gcov_merge_topn (gcov_type *counters, unsigned n_counters)
101     {
102       gcc_assert (!(n_counters % GCOV_TOPN_MEM_COUNTERS));
103
104       for (unsigned i = 0; i < (n_counters / GCOV_TOPN_MEM_COUNTERS); i++)
105         {
(gdb)
106           /* First value is number of total executions of the profiler.  */
107           gcov_type all = gcov_get_counter_ignore_scaling (-1);
108           gcov_type n = gcov_get_counter_ignore_scaling (-1);
109
110           counters[GCOV_TOPN_MEM_COUNTERS * i] += all;
111
112           for (unsigned j = 0; j < n; j++)
113             {
114               gcov_type value = gcov_get_counter_target ();
115               gcov_type count = gcov_get_counter_ignore_scaling (-1);
(gdb)
116
117               // TODO: we should use atomic here
118               gcov_topn_add_value (counters + GCOV_TOPN_MEM_COUNTERS * i, value,
119                                    count, 0, 0);
120             }
121         }
122     }
123     #endif /* L_gcov_merge_topn */
124
125     #endif /* inhibit_libc */
(gdb) print n
$1 = 140325305737200
(gdb) print j
$2 = 1896248771
"""

Compiler:

"""
$ ${HOME}/dev/git/gcc-native-quick-ggdb3/gcc/xgcc -B${HOME}/dev/git/gcc-native-quick-ggdb3/gcc -v
Reading specs from /home/slyfox/dev/git/gcc-native-quick-ggdb3/gcc/specs
COLLECT_GCC=/home/slyfox/dev/git/gcc-native-quick-ggdb3/gcc/xgcc
COLLECT_LTO_WRAPPER=/home/slyfox/dev/git/gcc-native-quick-ggdb3/gcc/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc/configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-languages=c,c++ --disable-bootstrap --with-multilib-list=m64 --prefix=/home/slyfox/dev/git/gcc-native-quick-ggdb3/../gcc-native-quick-installed-ggdb3 --disable-nls --without-isl --disable-libsanitizer --disable-libvtv --disable-libgomp --disable-libstdcxx-pch --disable-libunwind-exceptions CFLAGS='-O0 -ggdb3 ' CXXFLAGS='-O0 -ggdb3 ' --enable-valgrind-annotations
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.0.0 20200902 (experimental) (GCC)
"""

gcc-10 seems to be able to merge profiles.
Comment 1 Sergei Trofimovich 2020-09-03 07:41:27 UTC
Created attachment 49178 [details]
_json.gcda

Attaching _json.gcda in case it will shed the light if file was written incorrectly previously or it's a deserialization problem.
Comment 2 Sergei Trofimovich 2020-09-04 15:26:57 UTC
(In reply to Sergei Trofimovich from comment #0)
> The hang happens on real tauthon-2.8.2 interpreter from PR96394 (no nice
> reproducer yet).
> 
> In this instance I tried to build tauthon-2.8.2 against gcc-master. It hangs
> early when tries to merge topn entry:
> 
> """
> #0  0x00007fd73865e0ce in __GI___libc_read (fd=3, buf=0x406284
> <__gcov_var+36>, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:26
> #1  0x00007fd7385f0090 in __GI__IO_file_xsgetn (fp=0x1295bc0,
> data=<optimized out>, n=4096) at libioP.h:948
> #2  0x00007fd7385e4d1f in __GI__IO_fread (buf=0x406284 <__gcov_var+36>,
> size=1, count=4096, fp=0x1295bc0) at iofread.c:38
> #3  0x00007fd72ab4237e in gcov_read_words (words=2) at
> ../../../gcc/libgcc/../gcc/gcov-io.c:491
> #4  0x00007fd72ab42483 in __gcov_read_counter () at
> ../../../gcc/libgcc/../gcc/gcov-io.c:528
> #5  0x00007fd72ab4190e in gcov_get_counter_target () at
> ../../../gcc/libgcc/libgcov.h:383
> #6  0x00007fd72ab41c6b in __gcov_merge_topn (counters=0x7fd72ab4d320
> <__gcov4.encoder_clear>, n_counters=24) at
> ../../../gcc/libgcc/libgcov-merge.c:114
> #7  0x00007fd72ab43569 in merge_one_data (
>     filename=0x1154290
> "/tmp/portage/dev-lang/tauthon-2.8.2/work/x86_64-pc-linux-gnu/build/temp.
> linux-x86_64-2.8/tmp/portage/dev-lang/tauthon-2.8.2/work/tauthon-2.8.2/
> Modules/_json.gcda", gi_ptr=0x7fd72ab4a180, summary=0x7ffde7d6e9c0) at
> ../../../gcc/libgcc/libgcov-driver.c:314
> #8  0x00007fd72ab43b1a in dump_one_gcov (gi_ptr=0x7fd72ab4a180,
> gf=0x7ffde7d6ea00, run_counted=0, run_max=125)
>     at ../../../gcc/libgcc/libgcov-driver.c:492
> #9  0x00007fd72ab43cba in gcov_do_dump (list=0x7fd72ab4a180, run_counted=0)
> at ../../../gcc/libgcc/libgcov-driver.c:555
> #10 0x00007fd72ab43d28 in __gcov_dump_one (root=0x7fd72ab4e5c0
> <__gcov_root>) at ../../../gcc/libgcc/libgcov-driver.c:578
...
> 
> Looks like the problem is in decoding of some tag in __gcov_merge_topn().
> There count of entries is huge:
> 
> """
> (gdb) frame 6
> #6  0x00007fd72ab41c6b in __gcov_merge_topn (counters=0x7fd72ab4d320
> <__gcov4.encoder_clear>, n_counters=24) at
> ../../../gcc/libgcc/libgcov-merge.c:114
> 114               gcov_type value = gcov_get_counter_target ();
> (gdb) list __gcov_merge_topn
> 96         -- counter
> 97         */
> 98
> 99      void
> 100     __gcov_merge_topn (gcov_type *counters, unsigned n_counters)
> 101     {
> 102       gcc_assert (!(n_counters % GCOV_TOPN_MEM_COUNTERS));
> 103
> 104       for (unsigned i = 0; i < (n_counters / GCOV_TOPN_MEM_COUNTERS);
> i++)
> 105         {
> (gdb)
> 106           /* First value is number of total executions of the profiler. 
> */
> 107           gcov_type all = gcov_get_counter_ignore_scaling (-1);
> 108           gcov_type n = gcov_get_counter_ignore_scaling (-1);
> 109
> 110           counters[GCOV_TOPN_MEM_COUNTERS * i] += all;
> 111
> 112           for (unsigned j = 0; j < n; j++)
> 113             {
> 114               gcov_type value = gcov_get_counter_target ();
> 115               gcov_type count = gcov_get_counter_ignore_scaling (-1);
> (gdb)
> 116
> 117               // TODO: we should use atomic here
> 118               gcov_topn_add_value (counters + GCOV_TOPN_MEM_COUNTERS *
> i, value,
> 119                                    count, 0, 0);
> 120             }
> 121         }
> 122     }
> 123     #endif /* L_gcov_merge_topn */
> 124
> 125     #endif /* inhibit_libc */

> (gdb) print n
> $1 = 140325305737200

I looks like __gconv_merge_topn() is applied to 'indirect_call' counters contents. But it's content does not seem to match dynamic topn structure:

$ x86_64-pc-linux-gnu-gcov-dump -l _json.gcda

...
_json.gcda:    01a70000:   0:COUNTERS topn 0 counts
_json.gcda:    01a90000:  48:COUNTERS indirect_call 24 counts
_json.gcda:                   0: 1 1 140325305737168 1 1 140325305737200 0 0
_json.gcda:                   8: 0 0 0 0 0 0 0 0
_json.gcda:                  16: 0 0 0 0 0 0 0 0
...

Note how 140325305737200 is in the middle of topn.

My wild guess is that before

  commit 871e5ada6d53d5eb495cc9f323983f347487c1b2
  Author: Martin Liska <mliska@suse.cz>
  Date:   Fri Jan 31 13:10:14 2020 +0100

    Make TOPN counter dynamically allocated.

both indirect_call and topn had the same fixed n-value structure and were ~ok to be merged with __gconv_merge_topn().

But now topn got a special 0-values case (why did we emit it at all?) that merger can't handle and gets slightly past the beginning of 'indirect_call' section.
Comment 3 Sergei Trofimovich 2020-09-04 21:31:44 UTC
Specifically I think this is already a wrong format on disk:

> _json.gcda:    01a70000:   0:COUNTERS topn 0 counts
> _json.gcda:    01a90000:  48:COUNTERS indirect_call 24 counts
> _json.gcda:                   0: 1 1 140325305737168 1 1 140325305737200 0 0
> _json.gcda:                   8: 0 0 0 0 0 0 0 0
> _json.gcda:                  16: 0 0 0 0 0 0 0 0
> ...

Assuming indirect_call is in a 'hist' value format it should  be in form of:

  [total_executions, N, value1, counter1, ..., valueN, counterN]

Main problem: we have more than one entry here (which might be ok):
- record1 (ok):  total_executions=1 N=1 value1=140325305737168 counter1=1
- record2 (bad): total_executions=1 N=140325305737200 counter=0 ...

This is where we trip over enormous N.
Comment 4 Sergei Trofimovich 2020-09-06 10:23:16 UTC
(In reply to Sergei Trofimovich from comment #3)
> Specifically I think this is already a wrong format on disk:
> 
> > _json.gcda:    01a70000:   0:COUNTERS topn 0 counts
> > _json.gcda:    01a90000:  48:COUNTERS indirect_call 24 counts
> > _json.gcda:                   0: 1 1 140325305737168 1 1 140325305737200 0 0
> > _json.gcda:                   8: 0 0 0 0 0 0 0 0
> > _json.gcda:                  16: 0 0 0 0 0 0 0 0
> > ...
> 
> Assuming indirect_call is in a 'hist' value format it should  be in form of:
> 
>   [total_executions, N, value1, counter1, ..., valueN, counterN]
> 
> Main problem: we have more than one entry here (which might be ok):
> - record1 (ok):  total_executions=1 N=1 value1=140325305737168 counter1=1
> - record2 (bad): total_executions=1 N=140325305737200 counter=0 ...
> 
> This is where we trip over enormous N.

Found one of the causes of profiling data corruption. The bug happened earlier on initial serialization of indirect counters functions. The problematic code is:

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/libgcov-driver.c;h=58914268d4ece0b3a3a7dcb9cb21c4fa197fd770;hb=HEAD#l427

"""
 417       ci_ptr = gfi_ptr->ctrs;
 418       for (t_ix = 0; t_ix < GCOV_COUNTERS; t_ix++)
 419         {
 420           gcov_position_t n_counts;
 421 
 422           if (!gi_ptr->merge[t_ix])
 423             continue;
 424 
 425           n_counts = ci_ptr->num;
 426 
 427           if (gi_ptr->merge[t_ix] == __gcov_merge_topn)
 428             write_top_counters (ci_ptr, t_ix, n_counts);
 429           else
 430             {
 431               /* Do not stream when all counters are zero.  */
 432               int all_zeros = 1;
 433               for (unsigned i = 0; i < n_counts; i++)
 434                 if (ci_ptr->values[i] != 0)
 435                   {
 436                     all_zeros = 0;
 437                     break;
 438                   }
"""

The problematic line here is 'if (gi_ptr->merge[t_ix] == __gcov_merge_topn)'.

In case of tauthon '__gcov_merge_topn' is defined in two places:
1. in 'tauthon' binary itself
2. in any other shared library loaded by tauthon. Looks like 'libtauthon2.8.so.1.0' is the first one.

'__gcov_merge_topn' is defined as hidden symbol and gets resolved to local symbol:

$ x86_64-pc-linux-gnu-nm tauthon | fgrep gcov_merge_top
000000000040387f t __gcov_merge_topn
$ x86_64-pc-linux-gnu-nm libtauthon2.8.so.1.0 | fgrep gcov_merge_top
000000000029a202 t __gcov_merge_topn

Don't know yet know where 'gi_ptr->merge' gets filled in to leak executable's symbol into binary.
Comment 5 Sergei Trofimovich 2020-09-06 11:25:59 UTC
(In reply to Sergei Trofimovich from comment #4)

> $ x86_64-pc-linux-gnu-nm tauthon | fgrep gcov_merge_top
> 000000000040387f t __gcov_merge_topn
> $ x86_64-pc-linux-gnu-nm libtauthon2.8.so.1.0 | fgrep gcov_merge_top
> 000000000029a202 t __gcov_merge_topn
> 
> Don't know yet know where 'gi_ptr->merge' gets filled in to leak
> executable's symbol into binary.

Having looked at absolute addresses and address maps it's even more complicated that that:

call trace:

tauthon : main()
 -> libtauthon2.8.so.1.0 : execve()
    ->libtauthon2.8.so.1.0 : __gcov_execve()
      ->libtauthon2.8.so.1.0 : write_one_data()


There addresses of:
    if (gi_ptr->merge[t_ix] == __gcov_merge_topn)
are:
- gi_ptr->merge[t_ix]: _struct.so:&__gcov_merge_topn
- __gcov_merge_topn: libtauthon2.8.so.1.0:&__gcov_merge_topn

Sent https://gcc.gnu.org/pipermail/gcc-patches/2020-September/553320.html for review.

Claiming the bug.
Comment 6 Sergei Trofimovich 2020-09-06 11:31:24 UTC
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=5f32f9cf13f99f6295591927950aaf98aa8dba91 is probably the first time regression was introduced (gcc10+).
Comment 7 CVS Commits 2020-09-22 10:42:10 UTC
The master branch has been updated by Sergei Trofimovich <slyfox@gcc.gnu.org>:

https://gcc.gnu.org/g:4ecf368f4b4223fb2df4f3887429dfbb48852e38

commit r11-3351-g4ecf368f4b4223fb2df4f3887429dfbb48852e38
Author: Sergei Trofimovich <siarheit@google.com>
Date:   Sun Sep 6 12:13:54 2020 +0100

    gcov: fix TOPN streaming from shared libraries
    
    Before the change gcc did not stream correctly TOPN counters
    if counters belonged to a non-local shared object.
    
    As a result zero-section optimization generated TOPN sections
    in a form not recognizable by '__gcov_merge_topn'.
    
    The problem happens because in a case of multiple shared objects
    '__gcov_merge_topn' function is present in address space multiple
    times (once per each object).
    
    The fix is to never rely on function address and predicate on TOPN
    counter types.
    
    libgcc/ChangeLog:
    
            PR gcov-profile/96913
            * libgcov-driver.c (write_one_data): Avoid function pointer
            comparison in TOP streaming decision.
Comment 8 Sergei Trofimovich 2020-09-22 10:44:27 UTC
Should be fixed for gcc-11. Will send a backport to gcc-10 this evening (the code changed slightly since gcc-10).
Comment 9 Sergei Trofimovich 2020-10-02 10:04:38 UTC
(In reply to Sergei Trofimovich from comment #8)
> Should be fixed for gcc-11. Will send a backport to gcc-10 this evening (the
> code changed slightly since gcc-10).

gcc-10 backport sent for review as https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554902.html
Comment 10 CVS Commits 2020-10-02 12:30:09 UTC
The releases/gcc-10 branch has been updated by Sergei Trofimovich <slyfox@gcc.gnu.org>:

https://gcc.gnu.org/g:b843d57c57ac92c9cd38eec7289da2f1ea696f18

commit r10-8844-gb843d57c57ac92c9cd38eec7289da2f1ea696f18
Author: Sergei Trofimovich <siarheit@google.com>
Date:   Sun Sep 6 12:13:54 2020 +0100

    gcov: fix TOPN streaming from shared libraries
    
    Before the change gcc did not stream correctly TOPN counters
    if counters belonged to a non-local shared object.
    
    As a result zero-section optimization generated TOPN sections
    in a form not recognizable by '__gcov_merge_topn'.
    
    The problem happens because in a case of multiple shared objects
    '__gcov_merge_topn' function is present in address space multiple
    times (once per each object).
    
    The fix is to never rely on function address and predicate on TOPN
    counter types.
    
    libgcc/ChangeLog:
    
            PR gcov-profile/96913
            * libgcov-driver.c (write_one_data): Avoid function pointer
            comparison in TOP streaming decision.
    
    (cherry picked from commit 4ecf368f4b4223fb2df4f3887429dfbb48852e38)
Comment 11 Sergei Trofimovich 2020-10-02 12:32:15 UTC
Should be fixed for both affected gcc-10 and gcc-11.