Bug 89307 - -fprofile-generate binary may be too slow in multithreaded environment due to cache-line conflicts on counters
Summary: -fprofile-generate binary may be too slow in multithreaded environment due to...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: gcov-profile (show other bugs)
Version: 9.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-12 15:39 UTC by Jan Hubicka
Modified: 2023-07-07 08:20 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-02-13 00:00:00


Attachments
patch for tls counters (incomplete - no runtime bits) (4.96 KB, patch)
2019-02-13 14:57 UTC, Jan Hubicka
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2019-02-12 15:39:46 UTC
I fill in to track the problem of cacheline conflicts which is also dicussed in LLVM variant in http://lists.llvm.org/pipermail/llvm-dev/2014-April/072172.html
Comment 1 Martin Liška 2019-02-13 06:34:27 UTC
Can you please attach WIP patch you have?
Comment 2 Jan Hubicka 2019-02-13 14:57:59 UTC
Created attachment 45703 [details]
patch for tls counters (incomplete - no runtime bits)

Also I think google's code to reduce cacheline conflicts is https://gcc.gnu.org/ml/gcc-patches/2012-05/msg00959.html
Comment 3 Martin Liška 2019-02-14 12:45:24 UTC
(In reply to Jan Hubicka from comment #2)
> Created attachment 45703 [details]
> patch for tls counters (incomplete - no runtime bits)

Isn't the patch only a refactoring that is eliminating tls_model from tree_decl_with_vis and moving that into cgraph_node?
Comment 4 Martin Liška 2019-02-14 13:19:52 UTC
I'm just looking at the google/gcc-4.9 branch:
https://android.googlesource.com/toolchain/gcc/+/master/gcc-4.9/

and they have a sampling approach:

/* Transform:

   ORIGINAL CODE

   Into:

   __gcov_sample_counter++;
   if (__gcov_sample_counter >= __gcov_sampling_period)
     {
       __gcov_sample_counter = 0;
       ORIGINAL CODE
     }

which effectively updates edge counters just for a limited time. I would expect
size increase:

Removing basic block 9
Removing basic block 10
main (int argc)
{
  unsigned int PROF_sample.2;
  unsigned int PROF_sample.1;
  long int PROF_edge_counter_6;
  long int PROF_edge_counter_7;
  long int PROF_edge_counter_8;
  long int PROF_edge_counter_9;

  <bb 2>:
  __gcov_indirect_call_profiler_v2 (1005944783, main);
  __gcov_indirect_call_callee = 0B;
  if (argc_2(D) != 0)
    goto <bb 3>;
  else
    goto <bb 6>;

  <bb 3>:
  a = 123;
  PROF_sample.2_13 = __gcov_sample_counter;
  PROF_sample.2_14 = PROF_sample.2_13 + 1;
  __gcov_sample_counter = PROF_sample.2_14;
  PROF_sample.2_15 = __gcov_sampling_period;
  if (PROF_sample.2_14 >= PROF_sample.2_15)
    goto <bb 5>;
  else
    goto <bb 4>;

  <bb 4>:
  goto <bb 8>;

  <bb 5>:
  __gcov_sample_counter = 0;
  PROF_edge_counter_6 = __gcov0.main[0];
  PROF_edge_counter_7 = PROF_edge_counter_6 + 1;
  __gcov0.main[0] = PROF_edge_counter_7;
  goto <bb 8>;

  <bb 6>:
  a = 0;
  PROF_sample.1_10 = __gcov_sample_counter;
  PROF_sample.1_11 = PROF_sample.1_10 + 1;
  __gcov_sample_counter = PROF_sample.1_11;
  PROF_sample.1_12 = __gcov_sampling_period;
  if (PROF_sample.1_11 >= PROF_sample.1_12)
    goto <bb 7>;
  else
    goto <bb 4>;

  <bb 7>:
  __gcov_sample_counter = 0;
  PROF_edge_counter_8 = __gcov0.main[1];
  PROF_edge_counter_9 = PROF_edge_counter_8 + 1;
  __gcov0.main[1] = PROF_edge_counter_9;

  <bb 8>:
  return 0;
}
Comment 5 Martin Liška 2019-02-14 13:22:26 UTC
> 
> which effectively updates edge counters just for a limited time. I would
> expect

Ah now, it's really doing sampling. I guess it can lead to quite some profile inconsistencies..
Comment 6 Jan Hubicka 2019-02-14 14:42:36 UTC
> Ah now, it's really doing sampling. I guess it can lead to quite some profile
> inconsistencies..
Yep, it is not coolest solution. I would not worry too much about
precision loss unless you get some weird interference between the
sampling counter and actual program behaviour.  Adding conditionals
everywhere is not very good and I am not sure how well CPU will predict
such branches.

Honza
Comment 7 Richard Biener 2019-02-18 09:04:21 UTC
Btw, use of TLS has

 * size of counters overhead (one could use char sized TLS counters and
   update the global ones with locking on overflow)
 * tear-down/build-up cost at thread termination/creation

the advantage is of course it's simple implementation-wise.
Comment 8 Jakub Jelinek 2020-05-07 11:56:17 UTC
GCC 10.1 has been released.
Comment 9 Richard Biener 2020-07-23 06:51:51 UTC
GCC 10.2 is released, adjusting target milestone.
Comment 10 Richard Biener 2021-04-08 12:02:28 UTC
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Comment 11 Jakub Jelinek 2022-06-28 10:36:42 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.