54487 – [4.8 Regression] profiledbootstrap broken by r190952

Bug 54487 - [4.8 Regression] profiledbootstrap broken by r190952

Summary: [4.8 Regression] profiledbootstrap broken by r190952

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	gcov-profile (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	build

Depends on:
Blocks:

Reported:	2012-09-05 07:11 UTC by Markus Trippelsdorf
Modified:	2012-09-14 21:06 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-09-05 00:00:00

Attachments
bad .gcda file (3.40 KB, application/octet-stream) 2012-09-05 15:46 UTC, Markus Trippelsdorf	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Markus Trippelsdorf 2012-09-05 07:11:29 UTC

Breakage varies from bootstrap to bootstrap. Here is an example:
...
In file included from ../../gcc/gcc/config/i386/i386.c:40910:0:
./gt-i386.h: In function ‘ix86_set_reg_reg_cost(machine_mode)’:
./gt-i386.h:168:2: error: corrupted profile info: edge from 39 to 40 exceeds maximal count
 };
  ^
./gt-i386.h:168:2: error: corrupted profile info: edge from 40 to 41 exceeds maximal count
./gt-i386.h:168:2: error: corrupted profile info: edge from 41 to 43 exceeds maximal count
./gt-i386.h:168:2: error: corrupted profile info: edge from 42 to 43 exceeds maximal count
./gt-i386.h:168:2: error: corrupted profile info: profile data is not flow-consistent
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 30-42 thought to be -133660
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 30-31 thought to be 133660
./gt-i386.h:168:2: error: corrupted profile info: number of iterations for basic block 31 thought to be 267320
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 31-32 thought to be 133660
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 31-34 thought to be 133660
./gt-i386.h:168:2: error: corrupted profile info: number of iterations for basic block 34 thought to be 244050
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 34-35 thought to be 110390
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 34-38 thought to be 133660
./gt-i386.h:168:2: error: corrupted profile info: number of iterations for basic block 38 thought to be 226620
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 38-39 thought to be 92960
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 38-43 thought to be 133660
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 39-42 thought to be 92960
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 39-40 thought to be 0
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 40-42 thought to be 0
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 40-41 thought to be 0
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 41-42 thought to be 0
./gt-i386.h:168:2: error: corrupted profile info: number of executions for edge 41-43 thought to be 0
make[3]: *** [i386.o] Error 

Happens with (I'm a Gentoo ricer):
make -j4 BOOT_CFLAGS="-march=native -O3 -pipe" STAGE1_CFLAGS="-march=native -O3 -pipe" CFLAGS_FOR_TARGET="-march=native -O3 -pipe" CXXFLAGS_FOR_TARGET="-march=native -O3 -pipe" profiledbootstrap

(and
../gcc/configure --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --enable-lto --enable-nls --without-included-gettext --with-system-zlib --disable-werror --enable-initfini-array --with-gold --enable-secureplt --disable-multilib --enable-libmudflap --disable-libssp --disable-libgomp --enable-cld --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.8.0/python --enable-checking=release --disable-libgcj --enable-languages=c,c++ --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --with-build-config=bootstrap-lto --with-boot-ldflags="-Wl,-O1,--hash-style=gnu,--as-needed,--gc-sections,--icf=safe,--icf-iterations=3" --enable-version-specific-runtime-libs --disable-libstdcxx-pch --enable-libstdcxx-time=yes
)

Comment 1 Richard Biener 2012-09-05 09:18:41 UTC

CCing author of patch.

Comment 2 H.J. Lu 2012-09-05 12:40:59 UTC

Same here on x86-64:

http://gcc.gnu.org/ml/gcc-regression/2012-09/msg00080.html

Comment 3 H.J. Lu 2012-09-05 15:41:18 UTC

Also happens with revision 190982 on Fedora 18/x86-64.
I configured GCC with

--prefix=/usr/local --enable-clocale=gnu --with-system-zlib --enable-shared --with-demangler-in-ld --with-build-config=bootstrap-lto --with-fpmath=sse --enable-languages=c,c++,fortran,java,lto,objc

and used

make -j8 profiledbootstrap

on a 8-core machine.

Comment 4 Markus Trippelsdorf 2012-09-05 15:46:16 UTC

Created attachment 28133 [details]
bad .gcda file

I've attached a bad .gcda file.

Please note that both H.J. and I use --with-build-config=bootstrap-lto.

Comment 5 H.J. Lu 2012-09-05 16:17:01 UTC

I can reproduce it with only

--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld --enable-languages=c,c++  --prefix=/usr/local --enable-gnu-indirect-function --with-fpmath=sse

Comment 6 Teresa Johnson 2012-09-05 19:02:51 UTC

I finally got a reproducer for the error that H.J. reported. I will work on fixing that first. 

Markus, I looked at the gcda file you sent but don't see anything obviously wrong with it. gcov-dump reports that most of the counts in that file are 0 or very small, so I am not sure why there were messages about exceeding the maximal count.

Comment 7 Markus Trippelsdorf 2012-09-05 19:20:18 UTC

(In reply to comment #6)

> Markus, I looked at the gcda file you sent but don't see anything obviously
> wrong with it. gcov-dump reports that most of the counts in that file are 0 or
> very small, so I am not sure why there were messages about exceeding the
> maximal count.

Yes, as I wrote in my first message the point of failure always varies (for each bootstrap).
The attached cp-demangle.gcda was from a failure similar to the one H.J. 
reported in Comment 2.

I guess fixing H.J.'s issue will fix the other failures, too.

Comment 8 Teresa Johnson 2012-09-06 18:58:55 UTC

I think I have a solution for the issue that H.J. is encountering. Details below. Markus and H.J., would you be able to try the following patch to see if it addresses the failure you were seeing? Markus, were you only seeing failures when using a parallel make?

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 191035)
+++ libgcc/libgcov.c	(working copy)
@@ -707,7 +707,9 @@ gcov_exit (void)
 	    memcpy (cs_all, cs_prg, sizeof (*cs_all));
 	  else if (!all_prg.checksum
 		   && (!GCOV_LOCKED || cs_all->runs == cs_prg->runs)
-		   && memcmp (cs_all, cs_prg, sizeof (*cs_all)))
+                   && memcmp (cs_all, cs_prg,
+                              sizeof (*cs_all) - (sizeof (gcov_bucket_type)
+                                                  * GCOV_HISTOGRAM_SIZE)))
 	    {
 	      fprintf (stderr, "profiling:%s:Invocation mismatch - some data files may have been removed%s\n",
 		       gi_filename, GCOV_LOCKED


After looking at the cp-demangle matching issue for awhile, I finally realized that in my case at least, it was a valid issue with the preprocessed cp-demangle source code not matching the existing cp-demangle.gcda file. I tracked that down to different includes being done due to a difference in the libiberty configure. The libiberty config.log showed that there were some failures in some of the checks which were using the instrumented prev-gcc/xgcc that was giving errors like:

profiling:/home/tejohnson/extra/gcc_trunk_4_validate4/gcc/dwarf2out.gcda:Invocation mismatch - some data files may have been removed
configure:3427: $? = 0
configure: failed program was:
...

When profile merging happens, there are some sanity checks to ensure that the merged summaries for all object files are the same. These tests are failing in some cases due to small differences in the merged histograms, resulting in the above message. The total of the counter values in the histogram is the same, but there are slight differences in the cumulative counter values assigned to consecutive buckets. This could happen due to the way the cumulative counter values are apportioned out to the counters when merging. If the summaries are merged in different orders by the parallel runs, the integer division truncation may result in small differences. Ultimately these differences don't matter much as the sum of all the counter values saved in the histograms is consistent and the differences will be small and not significant. 

The best solution is to ignore the histogram when doing the sanity check, and just compare the high-level summary info (sum_all, sum_max, run_max, etc). That seems to be addressing the issue I was having. At least, I haven't been able to reproduce it yet.

Comment 9 Markus Trippelsdorf 2012-09-06 19:34:44 UTC

(In reply to comment #8)
> I think I have a solution for the issue that H.J. is encountering. Details
> below. Markus and H.J., would you be able to try the following patch to see if
> it addresses the failure you were seeing?

I just successfully completed a lto/profiledbootstrap with your patch applied.
Thanks Teresa.

> Markus, were you only seeing failures when using a parallel make?

I never use non-parallel make, so I can't tell.

Comment 10 Teresa Johnson 2012-09-06 20:02:30 UTC

That's good news. I will finish testing the patch and send it for review.

Thanks,
Teresa

Comment 11 H.J. Lu 2012-09-06 20:06:55 UTC

(In reply to comment #8)
> I think I have a solution for the issue that H.J. is encountering. Details
> below. Markus and H.J., would you be able to try the following patch to see if
> it addresses the failure you were seeing? Markus, were you only seeing failures
> when using a parallel make?
> 
> Index: libgcc/libgcov.c
> ===================================================================
> --- libgcc/libgcov.c    (revision 191035)
> +++ libgcc/libgcov.c    (working copy)
> @@ -707,7 +707,9 @@ gcov_exit (void)
>          memcpy (cs_all, cs_prg, sizeof (*cs_all));
>        else if (!all_prg.checksum
>             && (!GCOV_LOCKED || cs_all->runs == cs_prg->runs)
> -           && memcmp (cs_all, cs_prg, sizeof (*cs_all)))
> +                   && memcmp (cs_all, cs_prg,
> +                              sizeof (*cs_all) - (sizeof (gcov_bucket_type)
> +                                                  * GCOV_HISTOGRAM_SIZE)))
>          {
>            fprintf (stderr, "profiling:%s:Invocation mismatch - some data files
> may have been removed%s\n",
>                 gi_filename, GCOV_LOCKED
> 
> 

I applied it by hand:
diff --git a/libgcc/libgcov.c b/libgcc/libgcov.c
index fce8587..daf95af 100644
--- a/libgcc/libgcov.c
+++ b/libgcc/libgcov.c
@@ -707,7 +707,10 @@ gcov_exit (void)
       memcpy (cs_all, cs_prg, sizeof (*cs_all));
     else if (!all_prg.checksum
         && (!GCOV_LOCKED || cs_all->runs == cs_prg->runs)
-        && memcmp (cs_all, cs_prg, sizeof (*cs_all)))
+        && memcmp (cs_all, cs_prg, sizeof (*cs_all))
+        && memcmp (cs_all, cs_prg,
+              sizeof (*cs_all) - (sizeof (gcov_bucket_type)
+                   * GCOV_HISTOGRAM_SIZE)))
       {
         fprintf (stderr, "profiling:%s:Invocation mismatch - some data files may have been removed%s\n",
             gi_filename, GCOV_LOCKED

and it still failed for me on Fedora/18 24 core x86-64
with -j 12:

/export/gnu/import/git/gcc/gcc/tree-cfg.c: In function ‘bool gimple_can_merge_blocks_p(basic_block, basic_block)’:
/export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile info: edge from 52 to 53 exceeds maximal count
 }
 ^
/export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile info: edge from 54 to 55 exceeds maximal count
/export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile info: edge from 55 to 56 exceeds maximal count
/export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile info: edge from 56 to 57 exceeds maximal count
/export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: internal compiler error: in cgraph_create_edge_1, at cgraph.c:669
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make[4]: *** [tree-cfg.o] Error 1
make[4]: *** Waiting for unfinished jobs....
rm gcov.pod gcc.pod cpp.pod fsf-funding.pod gfdl.pod
make[4]: Leaving directory `/export/build/gnu/gcc-fdo/build-x86_64-linux/gcc'
make[3]: *** [all-stagefeedback-gcc] Error 2

Comment 12 Teresa Johnson 2012-09-06 20:23:50 UTC

On Thu, Sep 6, 2012 at 1:06 PM, hjl.tools at gmail dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>
> --- Comment #11 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-06 20:06:55 UTC ---
> (In reply to comment #8)
>> I think I have a solution for the issue that H.J. is encountering. Details
>> below. Markus and H.J., would you be able to try the following patch to see if
>> it addresses the failure you were seeing? Markus, were you only seeing failures
>> when using a parallel make?
>>
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 191035)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -707,7 +707,9 @@ gcov_exit (void)
>>          memcpy (cs_all, cs_prg, sizeof (*cs_all));
>>        else if (!all_prg.checksum
>>             && (!GCOV_LOCKED || cs_all->runs == cs_prg->runs)
>> -           && memcmp (cs_all, cs_prg, sizeof (*cs_all)))
>> +                   && memcmp (cs_all, cs_prg,
>> +                              sizeof (*cs_all) - (sizeof (gcov_bucket_type)
>> +                                                  * GCOV_HISTOGRAM_SIZE)))
>>          {
>>            fprintf (stderr, "profiling:%s:Invocation mismatch - some data files
>> may have been removed%s\n",
>>                 gi_filename, GCOV_LOCKED
>>
>>
>
> I applied it by hand:
> diff --git a/libgcc/libgcov.c b/libgcc/libgcov.c
> index fce8587..daf95af 100644
> --- a/libgcc/libgcov.c
> +++ b/libgcc/libgcov.c
> @@ -707,7 +707,10 @@ gcov_exit (void)
>        memcpy (cs_all, cs_prg, sizeof (*cs_all));
>      else if (!all_prg.checksum
>          && (!GCOV_LOCKED || cs_all->runs == cs_prg->runs)
> -        && memcmp (cs_all, cs_prg, sizeof (*cs_all)))
> +        && memcmp (cs_all, cs_prg, sizeof (*cs_all))
> +        && memcmp (cs_all, cs_prg,
> +              sizeof (*cs_all) - (sizeof (gcov_bucket_type)
> +                   * GCOV_HISTOGRAM_SIZE)))
>        {
>          fprintf (stderr, "profiling:%s:Invocation mismatch - some data files
> may have been removed%s\n",
>              gi_filename, GCOV_LOCKED
>
> and it still failed for me on Fedora/18 24 core x86-64
> with -j 12:
>
> /export/gnu/import/git/gcc/gcc/tree-cfg.c: In function ‘bool
> gimple_can_merge_blocks_p(basic_block, basic_block)’:
> /export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile
> info: edge from 52 to 53 exceeds maximal count
>  }
>  ^
> /export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile
> info: edge from 54 to 55 exceeds maximal count
> /export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile
> info: edge from 55 to 56 exceeds maximal count
> /export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: error: corrupted profile
> info: edge from 56 to 57 exceeds maximal count
> /export/gnu/import/git/gcc/gcc/tree-cfg.c:7904:1: internal compiler error: in
> cgraph_create_edge_1, at cgraph.c:669

Unfortunately I haven't been able to reproduce this one yet. Can you
send me the corrupt gcda file?

Thanks,
Teresa

> Please submit a full bug report,
> with preprocessed source if appropriate.
> See <http://gcc.gnu.org/bugs.html> for instructions.
> make[4]: *** [tree-cfg.o] Error 1
> make[4]: *** Waiting for unfinished jobs....
> rm gcov.pod gcc.pod cpp.pod fsf-funding.pod gfdl.pod
> make[4]: Leaving directory `/export/build/gnu/gcc-fdo/build-x86_64-linux/gcc'
> make[3]: *** [all-stagefeedback-gcc] Error 2
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 13 H.J. Lu 2012-09-06 20:49:02 UTC

It works for me now after syncing with revision 191037.

Comment 14 Teresa Johnson 2012-09-07 05:19:10 UTC

On Thu, Sep 6, 2012 at 1:49 PM, hjl.tools at gmail dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>
> --- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-06 20:49:02 UTC ---
> It works for me now after syncing with revision 191037.

Unfortunately, I have now hit this myself a couple times in further
testing, so I am still digging...

Teresa

>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 15 tejohnson 2012-09-07 13:49:54 UTC

Author: tejohnson
Date: Fri Sep  7 13:49:47 2012
New Revision: 191074

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191074
Log:
This fixes part of the issue described in PR gcov-profile/54487 where
there were warnings about mismatches due to slight differences in the
merged histograms in different object files. This can happen due to
the truncating integer division in the merge routine, which could result
in slightly different histograms when summaries are merged in different
orders.

2012-09-07  Teresa Johnson  <tejohnson@google.com>

        PR gcov-profile/54487
	* libgcc/libgcov.c (gcov_exit): Avoid warning on histogram
        differences.

Modified:
    trunk/libgcc/ChangeLog
    trunk/libgcc/libgcov.c

Comment 16 Teresa Johnson 2012-09-11 17:24:58 UTC

On Thu, Sep 6, 2012 at 10:18 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Thu, Sep 6, 2012 at 1:49 PM, hjl.tools at gmail dot com
> <gcc-bugzilla@gcc.gnu.org> wrote:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>>
>> --- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-06 20:49:02 UTC ---
>> It works for me now after syncing with revision 191037.
>
> Unfortunately, I have now hit this myself a couple times in further
> testing, so I am still digging...

After digging into this for several days now, I am convinced that the
gcda file is changing out from underneath the profile-use compile
during the read. I just don't understand how, since fcntl is being
used to lock the file. Details below. I am hoping someone has some
ideas or advice on where to go.

First, some info on the compile step and gcda file locking, then I
will explain why I belive the gcda file is being written mid-read.

The failure is occurring sporadically during the profile-use step of a
parallel profiledbootstrap. In this step, not only are the gcc
profiles being read in by the profile-use compilation, but the gcc
binaries being used in the compile are instrumented, so the gcda files
are also being written by the parallel build processes. However, the
fcntl file locking should prevent interference between the reads and
the writes of the same gcda file. According to the fcntl
documentation, the lock is lost on any fclose on the file by the same
process, even if it uses a different file descriptor. But for a given
profile use compile, the read step should be complete before the write
step when the gcc exits and updates the gcda files with libgcov. So I
don't think the lock should be lost this way.

The failure is not reproducible manually. Therefore, I have been
adding instrumentation to the compiler to spit out information at both
the point of failure ("error: corrupted profile info: edge from 54 to
55 exceeds maximal count" from read_profile_edge_counts()), and then
at the point when the gcda counts are being read in by
read_counts_file(). This is the loop that looks like:

          for (ix = 0; ix != n_counts; ix++)
            entry->counts[ix] += gcov_read_counter ();

I modified the above loop to check each count as it is read in against
the sum_max, and if it exceeds sum_max to print the summary and
counters at that point. I then compared this to the counters obtained
for that function from a gcov-dump of the gcda file afterwards. The
counters in the gcda file from gcov-dump are slightly higher because
of subsequent merges into it by other parallel builds before
compilation aborts, but it is still fairly easy to correlate the
counter values between them.

What I found is that the counter values being read by read_counts_file
look good up to a certain point, and then they go bad. Looking at the
huge bad values in hex, they are actually valid values from the gcda
file, but it looks like we suddenly jumped back or ahead several
locations. So for example, in some cases where it looks like we
suddenly jumped ahead in the gcda file, I see that some of the last
counter values being read into the array as counters are actually the
tag values from just after the counter array in the gcda file. Or in
some cases, the counter values are being read mis-aligned by one word,
so they look huge because instead of having 0x00000000 in the most
significant of the 2 counter words, the 0x00000000 is in the least
significant of the 2 counter words. In other words, we jumped some odd
number of words ahead (or behind) in the gcda file.

If another process is writing into the gcda file at the same time,
this could happen. Specifically, since the histogram now included in
the gcda file summaries with my patch only write non-zero values,
after a merge the size of the histogram, and therefore the size of the
summaries, could change. This will cause the starting offsets of
different tags/sections throughout the rest of the gcda file to shift.
This shouldn't be a problem if the file locking is working though. But
if the file locking is not working, then that would explain why
suddenly during the read we suddenly seem to jump to a different spot.

A couple of other notes:
- I added some code after each counter read in the above loop to seek
back to the offset where we read the tag, re-read it, and compare it
to the tag we read earlier (then re-seek back to the current location
in the counter array before continuing the read). Normally this check
succeeds, but in the cases where I am hitting the above errors, the
"tag" at that location has changed to something that doesn't look like
a tag.
- Jumping to a different spot should corrupt the reads of the rest of
the file. But the main loop in read_counts_file will simply ignore any
tag it doesn't understand, and exits when it reads a '0' tag. That's
why there was no error being flagged during the read when this was
happening.
- If something like this was happening on trunk before my patch, it
most likely would have been hidden because the size of the summaries
was static, and therefore so were the offsets within the file, except
in the case when a new summary was being added during the merge (which
is not very often). So it is possible that this problem existed prior
but was hidden.

Any suggestions or ideas?

Thanks,
Teresa

>
> Teresa
>
>>
>> --
>> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
>> ------- You are receiving this mail because: -------
>> You are on the CC list for the bug.
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comment 17 H.J. Lu 2012-09-11 17:29:15 UTC

Thanks for looking into it.  This is a long standing problem.
I have seen random profiledbootstrap failures for a long time.

Comment 18 Teresa Johnson 2012-09-11 17:39:00 UTC

On Tue, Sep 11, 2012 at 10:29 AM, hjl.tools at gmail dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>
> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-11 17:29:15 UTC ---
> Thanks for looking into it.  This is a long standing problem.
> I have seen random profiledbootstrap failures for a long time.

Thanks for confirming that this has happened prior. Unfortunately the
addition of the histogram is likely making this more frequent, due to
the changing summary sizes after merging. One way to deal with this
for now might be to write all histogram entries (even the 0 ones) into
the summary to keep the size static.

Honza, what do you think?

Teresa

>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 19 davidxl 2012-09-11 17:44:29 UTC

How much saving do we get by not writing out the 0 entries? With the
proposed change, how less frequent is the problem occuring?

David

On Tue, Sep 11, 2012 at 10:38 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 11, 2012 at 10:29 AM, hjl.tools at gmail dot com
> <gcc-bugzilla@gcc.gnu.org> wrote:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>>
>> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-11 17:29:15 UTC ---
>> Thanks for looking into it.  This is a long standing problem.
>> I have seen random profiledbootstrap failures for a long time.
>
> Thanks for confirming that this has happened prior. Unfortunately the
> addition of the histogram is likely making this more frequent, due to
> the changing summary sizes after merging. One way to deal with this
> for now might be to write all histogram entries (even the 0 ones) into
> the summary to keep the size static.
>
> Honza, what do you think?
>
> Teresa
>
>>
>> --
>> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
>> ------- You are receiving this mail because: -------
>> You are on the CC list for the bug.
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comment 20 Teresa Johnson 2012-09-11 18:05:13 UTC

On Tue, Sep 11, 2012 at 10:44 AM, Xinliang David Li <davidxl@google.com> wrote:
> How much saving do we get by not writing out the 0 entries? With the
> proposed change, how less frequent is the problem occuring?

Let me get back with some stats. Each histogram entry requires 5
words, and there are a max of 252 entries. In the few cases I checked
just now, we were printing about 60 entries per summary, with 3
summaries per gcda file. So printing the whole thing in these cases
would require 5*(252-60)*3 = 2880 extra words, or 11520 bytes.
Unfortunately, that is a significant increase over the current sizes
of those files, which are currently only double or triple that.

I also need to verify that changing this would reduce the frequency.

A couple other possibilities since this is not frequent:
- change the existing error to a warning (as we do under
flag_profile_correction)
- after finishing reading the counts, re-read the tag as I am doing in
my debugging, and if it is no longer valid, throw everything away and
re-read the file.
- check the counters after reading each one, and if it is > sum_max,
ignore it and abort the profile read with a warning but continue
compiling.

Obviously the best solution would be to figure out how the lock is
being lost/ignored and fix that, but that may take some time.

Teresa

>
> David
>
> On Tue, Sep 11, 2012 at 10:38 AM, Teresa Johnson <tejohnson@google.com> wrote:
>> On Tue, Sep 11, 2012 at 10:29 AM, hjl.tools at gmail dot com
>> <gcc-bugzilla@gcc.gnu.org> wrote:
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>>>
>>> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-11 17:29:15 UTC ---
>>> Thanks for looking into it.  This is a long standing problem.
>>> I have seen random profiledbootstrap failures for a long time.
>>
>> Thanks for confirming that this has happened prior. Unfortunately the
>> addition of the histogram is likely making this more frequent, due to
>> the changing summary sizes after merging. One way to deal with this
>> for now might be to write all histogram entries (even the 0 ones) into
>> the summary to keep the size static.
>>
>> Honza, what do you think?
>>
>> Teresa
>>
>>>
>>> --
>>> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
>>> ------- You are receiving this mail because: -------
>>> You are on the CC list for the bug.
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comment 21 davidxl 2012-09-11 18:08:26 UTC

Assuming the size of histogram for the same file does not vary that
much, is it better to round the size to the next power of 2 -- 60
entries will need print out 64 etc?

David

On Tue, Sep 11, 2012 at 11:04 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 11, 2012 at 10:44 AM, Xinliang David Li <davidxl@google.com> wrote:
>> How much saving do we get by not writing out the 0 entries? With the
>> proposed change, how less frequent is the problem occuring?
>
> Let me get back with some stats. Each histogram entry requires 5
> words, and there are a max of 252 entries. In the few cases I checked
> just now, we were printing about 60 entries per summary, with 3
> summaries per gcda file. So printing the whole thing in these cases
> would require 5*(252-60)*3 = 2880 extra words, or 11520 bytes.
> Unfortunately, that is a significant increase over the current sizes
> of those files, which are currently only double or triple that.
>
> I also need to verify that changing this would reduce the frequency.
>
> A couple other possibilities since this is not frequent:
> - change the existing error to a warning (as we do under
> flag_profile_correction)
> - after finishing reading the counts, re-read the tag as I am doing in
> my debugging, and if it is no longer valid, throw everything away and
> re-read the file.
> - check the counters after reading each one, and if it is > sum_max,
> ignore it and abort the profile read with a warning but continue
> compiling.
>
> Obviously the best solution would be to figure out how the lock is
> being lost/ignored and fix that, but that may take some time.
>
> Teresa
>
>>
>> David
>>
>> On Tue, Sep 11, 2012 at 10:38 AM, Teresa Johnson <tejohnson@google.com> wrote:
>>> On Tue, Sep 11, 2012 at 10:29 AM, hjl.tools at gmail dot com
>>> <gcc-bugzilla@gcc.gnu.org> wrote:
>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>>>>
>>>> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-09-11 17:29:15 UTC ---
>>>> Thanks for looking into it.  This is a long standing problem.
>>>> I have seen random profiledbootstrap failures for a long time.
>>>
>>> Thanks for confirming that this has happened prior. Unfortunately the
>>> addition of the histogram is likely making this more frequent, due to
>>> the changing summary sizes after merging. One way to deal with this
>>> for now might be to write all histogram entries (even the 0 ones) into
>>> the summary to keep the size static.
>>>
>>> Honza, what do you think?
>>>
>>> Teresa
>>>
>>>>
>>>> --
>>>> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
>>>> ------- You are receiving this mail because: -------
>>>> You are on the CC list for the bug.
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comment 22 H.J. Lu 2012-09-11 18:10:55 UTC

(In reply to comment #20)
>
> Obviously the best solution would be to figure out how the lock is
> being lost/ignored and fix that, but that may take some time.
> 

Can we use a lockfile to verify that fcntl lock is working
correctly?

Comment 23 Markus Trippelsdorf 2012-09-11 18:14:52 UTC

gcc/gcov-io.h has:
#if defined (HOST_HAS_F_SETLKW)
#define GCOV_LOCKED 1
#else
#define GCOV_LOCKED 0
#endif

But HOST_HAS_F_SETLKW isn't defined anywhere else AFAICS:
gcc % git grep HOST_HAS_F_SETLKW
gcc/gcov-io.h:#if defined (HOST_HAS_F_SETLKW)
gcc %

Comment 24 Teresa Johnson 2012-09-11 18:57:05 UTC

On Tue, Sep 11, 2012 at 11:14 AM, markus at trippelsdorf dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>
> --- Comment #23 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-09-11 18:14:52 UTC ---
> gcc/gcov-io.h has:
> #if defined (HOST_HAS_F_SETLKW)
> #define GCOV_LOCKED 1
> #else
> #define GCOV_LOCKED 0
> #endif
>
> But HOST_HAS_F_SETLKW isn't defined anywhere else AFAICS:
> gcc % git grep HOST_HAS_F_SETLKW
> gcc/gcov-io.h:#if defined (HOST_HAS_F_SETLKW)
> gcc %

Maybe it is as simple as that?! I thought I saw that GCOV_LOCKED was
set for my compile, but that may have been on the libgcov compile.

In fact, just above the code Markus shows from gcov-io.h, when
IN_LIBGCOV, GCOV_LOCKED is set based on TARGET_POSIX_IO:

#if defined (TARGET_POSIX_IO)
#define GCOV_LOCKED 1
#else
#define GCOV_LOCKED 0
#endif

Indeed, when I look at the preprocessed libgcov.c output from its
compile command, the GCOV_LOCKED is clearly set (by looking at the
preprocessed gcov_open() code).

But when I use the compile command for coverage.c, which includes
gcov-io.c but is !IN_LIBGCOV (so GCOV_LOCKED is set based on
HOST_HAS_F_SETLKW), the preprocessed gcov_open code is that of a
!GCOV_LOCKED compile, without the call to fcntl.

So perhaps it is just the case that the libgcov code is that writes
the gcda files is doing the locking, but the read on profile-use is
not!

Anyone know how HOST_HAS_F_SETLKW was supposed to be set? I do see
that my configure is setting HAVE_FCNTL_H, perhaps that was intended?

Teresa

>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 25 Jakub Jelinek 2012-09-11 18:58:59 UTC

Indeed, seems http://gcc.gnu.org/ml/gcc-patches/2003-05/msg00571.html
has introduced use of that macro, but didn't bother to actually define it anywhere.

Comment 26 Jakub Jelinek 2012-09-11 19:04:47 UTC

For the check, I guess you want to check that you can actually compile on host something like:
#include <fcntl.h>

int
main ()
{
  struct flock fl;
  fl.l_whence = 0;
  fl.l_start = 0;
  fl.l_len = 0;
  fl.l_pid = 0;
  return fcntl (1, F_SETLKW, &fl);
}

Comment 27 Teresa Johnson 2012-09-11 19:08:07 UTC

Thanks for the pointers, Jakub. I'll work on adding this check.
Teresa

On Tue, Sep 11, 2012 at 12:04 PM, jakub at gcc dot gnu.org
<gcc-bugzilla@gcc.gnu.org> wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487
>
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jakub at gcc dot gnu.org
>
> --- Comment #26 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-09-11 19:04:47 UTC ---
> For the check, I guess you want to check that you can actually compile on host
> something like:
> #include <fcntl.h>
>
> int
> main ()
> {
>   struct flock fl;
>   fl.l_whence = 0;
>   fl.l_start = 0;
>   fl.l_len = 0;
>   fl.l_pid = 0;
>   return fcntl (1, F_SETLKW, &fl);
> }
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 28 tejohnson 2012-09-13 04:59:18 UTC

Author: tejohnson
Date: Thu Sep 13 04:59:14 2012
New Revision: 191238

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191238
Log:
This fixes PR gcov-profile/54487 where the gcda files were not locked
by the profile-use read, enabling writes by other instrumented compiles
to change the profile in the middle of the profile use read. The GCOV_LOCKED
macro was not set because it was guarded by HOST_HAS_F_SETLKW, which was
never set. The fix is to add a compile test in the configure to set it.

2012-09-12  Teresa Johnson  <tejohnson@google.com>

	PR gcov-profile/54487
	* configure.ac (HOST_HAS_F_SETLKW): Set based on compile
	test using F_SETLKW with fcntl.
	* configure, config.in: Regenerate.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config.in
    trunk/gcc/configure
    trunk/gcc/configure.ac

Comment 29 Markus Trippelsdorf 2012-09-13 06:39:02 UTC

Fixed. Thanks all.

Comment 30 tejohnson 2012-09-13 13:32:45 UTC

Author: tejohnson
Date: Thu Sep 13 13:32:31 2012
New Revision: 191254

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191254
Log:
2012-09-12  Teresa Johnson  <tejohnson@google.com>

	Backport from mainline.
	2012-09-12  Teresa Johnson  <tejohnson@google.com>

        PR gcov-profile/54487
        * configure.ac (HOST_HAS_F_SETLKW): Set based on compile
        test using F_SETLKW with fcntl.
        * configure, config.in: Regenerate.

Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/config.in
    branches/gcc-4_7-branch/gcc/configure
    branches/gcc-4_7-branch/gcc/configure.ac

Comment 31 Jan Hubicka 2012-09-13 16:17:10 UTC

Huh, 9 years for sucha nonsence is quite impressive. Thanks for tracking it down! :)

Comment 32 tejohnson 2012-09-14 21:06:54 UTC

Author: tejohnson
Date: Fri Sep 14 21:06:49 2012
New Revision: 191312

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191312
Log:
Backport from trunk r190952 to add counter histogram to gcov program summary,
and follow-on fixes for PR gcov-profile/54487 (r191074 and r191238).

2012-09-14  Teresa Johnson  <tejohnson@google.com>

	* libgcc/libgcov.c (gcov_histogram_insert): New function.
	(gcov_compute_histogram): Ditto.
	(sort_by_reverse_gcov_value): Remove function.
	(gcov_compute_cutoff_values): Ditto.
	(gcov_merge_gcda_file): Merge histogram while merging summary.
	(gcov_gcda_file_size): Include histogram in summary size computation.
	(gcov_write_gcda_file): Remove assert that is no longer valid.
	(gcov_exit_init): Invoke gcov_compute_histogram.
	* gcc/gcov-io.c (gcov_write_summary): Write out non-zero histogram
        entries to function summary along with an occupancy bit vector.
	(gcov_read_summary): Read in the histogram entries.
	(gcov_histo_index): New function.
	(gcov_histogram_merge): Ditto.
	* gcc/gcov-io.h (gcov_type_unsigned): New type.
        (struct gcov_bucket_type): Ditto.
        (struct gcov_ctr_summary): Include histogram.
        (GCOV_TAG_SUMMARY_LENGTH): Update to include histogram entries.
        (GCOV_HISTOGRAM_SIZE): New macro.
        (GCOV_HISTOGRAM_BITVECTOR_SIZE): Ditto.
        (gcov_gcda_file_size): New parameter.
	* gcc/profile.c (NUM_GCOV_WORKING_SETS): Ditto.
        (gcov_working_sets): New global variable.
	(compute_working_sets): New function.
	(find_working_set): Ditto.
	(get_exec_counts): Invoke compute_working_sets.
	* gcc/loop-unroll.c (code_size_limit_factor): Call new function
        find_working_set to obtain working set information.
	* gcc/coverage.c (read_counts_file): Merge histograms, and
        fix bug with accessing summary info for non-summable counters.
	* gcc/basic-block.h (gcov_type_unsigned): New type.
        (struct gcov_working_set_info): Ditto.
        (find_working_set): Declare.
	* gcc/gcov-dump.c (tag_summary): Dump out histogram.
	* gcc/configure.ac (HOST_HAS_F_SETLKW): Set based on compile
        test using F_SETLKW with fcntl.
	* gcc/configure, gcc/config.in: Regenerate.

Modified:
    branches/google/gcc-4_7/gcc/ChangeLog.google-4_7
    branches/google/gcc-4_7/gcc/basic-block.h
    branches/google/gcc-4_7/gcc/config.in
    branches/google/gcc-4_7/gcc/configure
    branches/google/gcc-4_7/gcc/configure.ac
    branches/google/gcc-4_7/gcc/coverage.c
    branches/google/gcc-4_7/gcc/gcov-dump.c
    branches/google/gcc-4_7/gcc/gcov-io.c
    branches/google/gcc-4_7/gcc/gcov-io.h
    branches/google/gcc-4_7/gcc/loop-unroll.c
    branches/google/gcc-4_7/gcc/profile.c
    branches/google/gcc-4_7/libgcc/ChangeLog.google-4_7
    branches/google/gcc-4_7/libgcc/libgcov.c