This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [Patch] Handle profile counts truncated to 0 in coldness test
- From: Teresa Johnson <tejohnson at google dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, David Li <davidxl at google dot com>
- Date: Tue, 15 Oct 2013 15:58:45 -0700
- Subject: Re: [Patch] Handle profile counts truncated to 0 in coldness test
- Authentication-results: sourceware.org; auth=none
- References: <CAAe5K+UJOuP3JGpGuRk1b9ZdeWbhx5tEc6A8ajX3t5OgirrhAQ at mail dot gmail dot com> <20131006134331 dot GA5429 at kam dot mff dot cuni dot cz> <CAAe5K+UBqEG0Q3UEqDWvc8+ir2ET_C6VyGAVHNB6k9kdNUoGgw at mail dot gmail dot com> <CAAe5K+XBSxA3exc+TZyiCaABebdDN3OknkwMP-5Z_Lsh_jWjrQ at mail dot gmail dot com>
Ping on this one since the new param would be useful in the other
profile related patch I'm working on (the one to handle the dropped
profile counts in tree-profile).
Thanks,
Teresa
On Thu, Oct 10, 2013 at 2:35 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Mon, Oct 7, 2013 at 10:45 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> On Sun, Oct 6, 2013 at 6:43 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> 2013-10-03 Teresa Johnson <tejohnson@google.com>
>>>>
>>>> * params.def (PARAM_MIN_HOT_RUN_RATIO): New parameter.
>>>
>>> I would not mention HOT in the parameter name. Blocks are hot/normal/unlikely
>>> and we HOT_BB_FREQUENCY_FRACTION.
>>>
>>> So perhaps UNLIKELY_BB_COUNT_FRACTION with an explanation that it is relative
>>> to number of train runs?
>>
>> Ok.
>>
>>>> +DEFPARAM(PARAM_MIN_HOT_RUN_RATIO,
>>>> + "min-hot-run-ratio",
>>>> + "The minimum ratio of profile runs to basic block execution count "
>>>> + "for the block to be considered hot",
>>> Perhaps as:
>>>> + "The maximum ratio of profile runs to basic block execution count "
>>>> + "for the block to be considered unlikely",
>>>
>>>> + 4, 0, 10000)
>>>
>>> lower bound should be 1 or we divide by 0.
>>
>> Yep, will fix.
>>
>>>> Index: predict.c
>>>> ===================================================================
>>>> --- predict.c (revision 203152)
>>>> +++ predict.c (working copy)
>>>> @@ -237,17 +237,31 @@ probably_never_executed (struct function *fun,
>>>> gcc_checking_assert (fun);
>>>> if (profile_status_for_function (fun) == PROFILE_READ)
>>>> {
>>>> - if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>>>> + int min_run_ratio = PARAM_VALUE (PARAM_MIN_HOT_RUN_RATIO);
>>>> + if (RDIV (count * min_run_ratio, profile_info->runs) > 0)
>>>
>>> This way if count is 1, runs needs to be n_run_ratio * 2
>>> So perhaps if (count * min_run_ratio >= profile_info->runs) instead of division.
>>
>> Ok.
>>
>>>> - if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
>>>> + if (ENTRY_BLOCK_PTR->count)
>>>> {
>>>> - return (RDIV (frequency * ENTRY_BLOCK_PTR->count,
>>>> - ENTRY_BLOCK_PTR->frequency)
>>>> - < REG_BR_PROB_BASE / 4);
>>>> + gcov_type scaled_count
>>>> + = frequency * ENTRY_BLOCK_PTR->count * min_run_ratio;
>>>> + gcov_type computed_count;
>>>> + /* Check for overflow, in which case ENTRY_BLOCK_PTR->count should
>>>> + be large enough to do the division first without losing much
>>>> + precision. */
>>>> + if (scaled_count/frequency/min_run_ratio != ENTRY_BLOCK_PTR->count)
>>>
>>> I do not like the undefined behaviour triggered before the check (sure unsigned
>>> arithmetic would fix that, but it seems all strange). What about guarding the
>>> whole code.
>>>
>>> if (ENTRY_BLOCK_PTR->count && count < REG_BR_PROB_BASE)
>>> For large counts the roudoff problems in first test should not be problem.
>>
>> Sure, that sounds reasonable.
>
> Actually, we don't care about count at this point, just ENTRY_BLOCK_PTR->count.
>
>>
>>>> + {
>>>> + computed_count = RDIV (ENTRY_BLOCK_PTR->count,
>>>> + ENTRY_BLOCK_PTR->frequency);
>>>> + computed_count *= frequency * min_run_ratio;
>>>> + }
>>>> + else
>>>> + computed_count = RDIV (scaled_count, ENTRY_BLOCK_PTR->frequency);
>>>> + if (RDIV (computed_count * min_run_ratio, profile_info->runs) > 0)
>>>
>>> So at nonoverflow path you compute
>>>
>>> ((frequency * entry_bb_count * min_run_ratio) / entry_bb_frequency) * min_run_ratio / runs > 0.5
>>
>> Yes, there is an extra min_run_ratio factor in there that I need to remove!
>>>
>>> I think you want
>>>
>>> ((frequency * entry_bb_count * min_run_ratio) / entry_bb_frequency >= runs
>>
>> Right, will change that too.
>
> Note that the above change essentially decreases the min_run_ratio in
> half (since we are comparing >= runs instead of >= 0.5 runs due to the
> rounding divide).
>
>>
>>>
>>> If entry_bb_frequency is 0, you can just return true iff frequency is > min_run_ratio.
>>
>> We already return false conservatively if entry_bb_frequency is 0. Do
>> we really want to change this - I'm not sure it makes sense to compare
>> the frequency to the run ratio directly.
>>
>>>
>>> I think safe way to compute this is
>>>
>>> if (ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
>>> scaled_count = ((frequency * entry_bb_count * min_run_ratio) / entry_bb_frequency;
>>> else
>>> scaled_count = ((frequency * entry_bb_count) / entry_bb_frequency) * min_run_ratio
>>>
>>> In first path, we have at most 10000*10000*10000*10000 that still fits in 64bit
>>> value.
>>>
>>> In the second path the base of fixed point arithmetic is large enough so the
>>> division should not matter. We never get over entry_count * 10000 that is safe.
>>> The first part of statement however just compute value that should match count
>>> so count > 10000 unless profile got misupated and the low-count roundoff errors should
>>> not matter. So perhaps the whole code can execute only if
>>> if (ENTRY_BLOCK_PTR->count && count < REG_BR_PROB_BASE
>>> && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
>>> and we do not need any conditionals for scaled_count
>
> As noted above, we don't use count here, so this can be simplified to:
>
> if (ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
>
> (and ENTRY_BLOCK_PTR->count > 0, which is checked earlier). See the
> fixed patch below.
>
>>>
>>> Does such code work with ratio changed to 16? I think the double multiplication in your
>>> patch may have caused problems.
>>
>> Let me fix this up and will get back.
>
> As noted above, the change to avoid the RDIV by runs decreases the
> effect of the min_run_ratio (now unlikely_count_fraction) in half. So
> we really need to increase this parameter to 32 to compare to what the
> old version of the code was doing.
>
> Indeed, using 32 does fix the same set of issues. However, when
> setting the parameter to the old default of 4, there are 5 core dumps
> using the patch to make the unlikely code unexecutable with a 100
> training runs (see below for some examples why). This goes down to 2
> failures if I set the parameter to 10, and 0 if I set it to 20. Using
> 20 essentially means that the code has to be executed 5% of the runs,
> which seems reasonable for something like splitting. What do you
> think?
>
> A couple examples why there was cold split code being executed with
> 100 train runs.
>
> 1) No context sensitivity for routine that is inlined:
>
> In /usr/local/google/home/tejohnson/gcc_trunk_5/gcc/testsuite/gcc.c-torture/execute/pr33870.c,
> we call merge_pagelist 3 times from sort_pagelist. In the profile-use
> compile merge_pagelist is inlined (I believe in all 3 locations). One
> of the conditional blocks (line 23) has a profile count that is less
> than runs/4. It turns out that from the first merge_pagelist callsite
> we always execute this block, but the other two are completely cold.
> But the profile info from all three calls is of course merged, and it
> looks like it is executed infrequently overall. So the block is split
> but then we execute it from the inline instance corresponding to the
> first call.
>
> I'm not sure what we can do in general here. But it is another data
> point suggesting to me that the blocks should be very cold to be
> split.
>
> 2) RTL expansion:
>
> Another was in the following code from
> /usr/local/google/home/tejohnson/gcc_trunk_5/gcc/testsuite/gcc.c-torture/execute/pr43784.c:
>
> static struct s __attribute__((noinline)) rp(void)
> {
> return *p;
> }
>
> where p is a global that is assigned as:
>
> static union u v;
> static struct s *p = &v.d.b;
>
> RTL expansion synthesizes a bunch of tests and branches (alignment
> handling?), and guesses at the probabilities of the resulting
> branches. Unfortunately, this is less than accurate and we end up
> executing a block that is given a small likelihood, making it below
> the threshold with the default unlikely fraction of 4.
>
> Here is the patch with the current threshold kept as 4. Let me know if
> you think we can kick this up to 20, or if we should just increase the
> threshold for splitting.
>
> Passes regression tests, ok for trunk?
>
> (BTW, getting close. I have a bunch of fixes to the loop unroller that
> need some cleanup, and a small fix to ssa tail merging to update the
> profiles, but I think that does it for the tests I was looking at.
> Will send those soon for review.)
>
> Thanks,
> Teresa
>
> 2013-10-10 Teresa Johnson <tejohnson@google.com>
>
> * predict.c (probably_never_executed): Compare frequency-based
> count to number of training runs.
> * params.def (UNLIKELY_BB_COUNT_FRACTION): New parameter.
>
> Index: predict.c
> ===================================================================
> --- predict.c (revision 203390)
> +++ predict.c (working copy)
> @@ -237,17 +237,33 @@ probably_never_executed (struct function *fun,
> gcc_checking_assert (fun);
> if (profile_status_for_function (fun) == PROFILE_READ)
> {
> - if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
> + int unlikely_count_fraction = PARAM_VALUE (UNLIKELY_BB_COUNT_FRACTION);
> + if (count * unlikely_count_fraction >= profile_info->runs)
> return false;
> if (!frequency)
> return true;
> if (!ENTRY_BLOCK_PTR->frequency)
> return false;
> - if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
> + if (ENTRY_BLOCK_PTR->count)
> {
> - return (RDIV (frequency * ENTRY_BLOCK_PTR->count,
> - ENTRY_BLOCK_PTR->frequency)
> - < REG_BR_PROB_BASE / 4);
> + gcov_type computed_count;
> + /* Check for possibility of overflow, in which case entry bb count
> + is large enough to do the division first without losing much
> + precision. */
> + if (ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
> + {
> + gcov_type scaled_count
> + = frequency * ENTRY_BLOCK_PTR->count *
> unlikely_count_fraction;
> + computed_count = RDIV (scaled_count, ENTRY_BLOCK_PTR->frequency);
> + }
> + else
> + {
> + computed_count = RDIV (ENTRY_BLOCK_PTR->count,
> + ENTRY_BLOCK_PTR->frequency);
> + computed_count *= frequency * unlikely_count_fraction;
> + }
> + if (computed_count >= profile_info->runs)
> + return false;
> }
> return true;
> }
> Index: params.def
> ===================================================================
> --- params.def (revision 203390)
> +++ params.def (working copy)
> @@ -373,6 +373,11 @@ DEFPARAM(HOT_BB_FREQUENCY_FRACTION,
> "Select fraction of the maximal frequency of executions of
> basic block in function given basic block needs to have to be
> considered hot",
> 1000, 0, 0)
>
> +DEFPARAM(UNLIKELY_BB_COUNT_FRACTION,
> + "unlikely-bb-count-fraction",
> + "The minimum fraction of profile runs a given basic block
> execution count must be not to be considered unlikely",
> + 4, 1, 10000)
> +
> DEFPARAM (PARAM_ALIGN_THRESHOLD,
> "align-threshold",
> "Select fraction of the maximal frequency of executions of
> basic block in function given basic block get alignment",
>
>
>>
>> Thanks,
>> Teresa
>>
>>>
>>> I will check the second part of patch separately.
>>> Thanks!
>>> Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413