This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)


On Tue, Apr 24, 2012 at 4:38 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Apr 24, 2012 at 11:26 PM, Teresa Johnson <tejohnson@google.com> wrote:
>
>> ? ? ? ?* params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param.
>> ? ? ? ?(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto.
>
> You should add documentation for these new PARAMs to doc/invoke.texi.

Ok, will do.

>
> I don't really like these new PARAMs: All other loop PARAMs are based
> on the number of insns in a loop, or the maximum number of times a
> transformation is applied. Your new
> PARAM_MIN_ITER_UNROLL_WITH_BRANCHES is completely different, because
> it is a number of iterations. This makes the PARAM value feel even
> more arbitrary than all the other PARAMs to some extend already do...

That's true, they are different in what they are checking than some of
the other loop unrolling params. But I need some threshold for
determining when a loop is hot enough that its unrolled branches will
be executed frequently enough to train the branch predictor and also
where the impact on the branch prediction in the outer region of code
is less likely to matter overall. The defaults were chosen so that the
new unrolling limit should only kick in for loops that are not
iterating much anyway, and where the outer hot region has quite a few
branches.

>
> (The only other PARAM like that is PARAM_ALIGN_LOOP_ITERATIONS, and
> its default value also looks quite arbitrary...)
>
>
>> Index: loop-unroll.c
>> ===================================================================
>> --- loop-unroll.c ? ? ? (revision 186783)
>> +++ loop-unroll.c ? ? ? (working copy)
>> @@ -152,6 +152,180 @@ static void combine_var_copies_in_loop_exit (struc
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? basic_block);
>> ?static rtx get_expansion (struct var_to_expand *);
>>
>> +/* Determine whether LOOP contains call. ?*/
>> +static bool
>> +loop_has_call(struct loop *loop)
>> +{
>> + ?basic_block *body, bb;
>> + ?unsigned i;
>> + ?rtx insn;
>> +
>> + ?body = get_loop_body (loop);
>> + ?for (i = 0; i < loop->num_nodes; i++)
>> + ? ?{
>> + ? ? ?bb = body[i];
>> +
>> + ? ? ?FOR_BB_INSNS (bb, insn)
>> + ? ? ? ?{
>> + ? ? ? ? ?if (CALL_P (insn))
>> + ? ? ? ? ? ?{
>> + ? ? ? ? ? ? ?free (body);
>> + ? ? ? ? ? ? ?return true;
>> + ? ? ? ? ? ?}
>> + ? ? ? ?}
>> + ? ?}
>> + ?free (body);
>> + ?return false;
>> +}
>> +
>> +/* Determine whether LOOP contains floating-point computation. ?*/
>> +static bool
>> +loop_has_FP_comp(struct loop *loop)
>> +{
>> + ?rtx set, dest;
>> + ?basic_block *body, bb;
>> + ?unsigned i;
>> + ?rtx insn;
>> +
>> + ?body = get_loop_body (loop);
>> + ?for (i = 0; i < loop->num_nodes; i++)
>> + ? ?{
>> + ? ? ?bb = body[i];
>> +
>> + ? ? ?FOR_BB_INSNS (bb, insn)
>> + ? ? ? ?{
>> + ? ? ? ? ?set = single_set (insn);
>> + ? ? ? ? ?if (!set)
>> + ? ? ? ? ? ?continue;
>> +
>> + ? ? ? ? ?dest = SET_DEST (set);
>> + ? ? ? ? ?if (FLOAT_MODE_P (GET_MODE (dest)))
>> + ? ? ? ? ? ?{
>> + ? ? ? ? ? ? ?free (body);
>> + ? ? ? ? ? ? ?return true;
>
> So you only detect single-set FP operations where some insns stores in
> a float mode. It wouldn't be very difficult to just walk over all sets
> and look for float modes. This is also necessary e.g. for x87 sincos,
> as well as various insns on other machines. Your comments say you
> don't want to apply the new heuristic to loops containing FP
> operations because these loops usually benefit more from unrolling.
> Therefore, you should IMHO look at non-single_set() insns also here,
> to avoid applying the heuristics to loops containing non-single_set()
> FP insns.

Ok, thanks for the suggestion, I will expand this for the next version
of the patch.

>
>
>> + ? ? ? ? ? ?}
>> + ? ? ? ?}
>> + ? ?}
>> + ?free (body);
>> + ?return false;
>> +}
>
> Nit: You are calling loop_has_call and loop_has_FP_comp() twice on
> each loop (first for constant iterations and next for runtime
> iterations),

I don't think that is true for loop_has_FP_comp, since it is called in
decide_unroll_constant_iterations and decide_unroll_runtime_iterations
just after we have checked if the loop has a constant number of
iterations, and returned early depending on the result of this check
and which routine we are in. So each inner loop will only reach the
call to loop_has_FP_comp in one of these routines.

In the case of loop_has_call, which is only called for a hot outer
loop, it is true we could invoke that more than once. That would
happen if a hot outer loop contains more than one nested inner loop
with a small iteration count and branches that we attempt to unroll
(it is called at most once per inner loop that we attempt to unroll).
I thought about attempting to cache this info for the outer loop in
the structure returned by get_simple_loop_desc() as you also suggest
below. I was concerned that currently this returns an niter_desc
structure which holds info about the # iterations, and this
information doesn't fit into that category. However, I could go ahead
and add it to that structure and perhaps rename the structure to
something more generic like "loop_desc". What do you think?

The other issue is that we don't need this new information on all
loops where we currently may compute and return an niter_desc
instance, and I didn't want to unnecessarily add a walk over the loop
bodies to compute the new information if we didn't need it. I'm not
sure of the tradeoff between always computing this information when we
do a get_simple_loop_desc vs calling it a couple times each for the
loops where we do need it (only when we have innermost loops with
shorter iteration counts and internal branches that we have decided to
unroll). One way around this would be to set the initial value of this
information in the loop_desc to a default value that means "unknown"
and then compute and cache the information lazily as needed.

> maybe you can fuse the functions

I don't know that it makes sense to fuse them, as they are called on
different loops - the call to loop_has_FP_comp will only be done for
innermost loops that we are attempting to unroll, whereas
loop_has_call is called only for outer loops that contain nested inner
loops. But if I am going to cache the results in the loop desc then it
may not be too expensive to do all the checks in one walk over each
loop, regardless of whether it is an outer or inner loop.

> and cache the results
> (e.g. with two bitmaps, or put it in the loop description and retrieve
> it with get_simple_loop_desc). Actually num_loop_branches()
> could/should also be cached. I realize that the loop body walks are
> probably not very expensive (and compile time probably isn't a concern
> if you're using profile driven optimizations) but they do all add
> up...

True, num_loop_branches could be called a couple times for each
innermost loop we attempt to unroll, because it is also called by
decide_peel_simple and decide_unroll_stupid. If I cache the other info
in a loop_desc as described above, we could cache this as well as the
result of compute_weighted_branches.

>
>
>> +/* Compute the number of branches in LOOP, weighted by execution counts. ?*/
>> +static float
>> +compute_weighted_branches(struct loop *loop)
>
> The floating point thing was already mentioned by Andrew. You can use
> integer math instead (for examples, look for BB_FREQ_MAX e.g. in
> average_num_loop_insns()).
>
>
>> + ?while (loop_outer(outer))
>> + ? ?{
>> + ? ? ?outer_desc = get_simple_loop_desc (outer);
>> +
>> + ? ? ?if (outer_desc->const_iter)
>> + ? ? ? ?outer_niters *= outer_desc->niter;
>> + ? ? ?else if (outer->header->count)
>> + ? ? ? ?outer_niters *= expected_loop_iterations (outer);
>> +
>> + ? ? ?weighted_outer_branches = compute_weighted_branches (outer);
>
> Can you delay this computation of "weighted_outer_branches" call to ...

Yes - thanks for the suggestion.

>
>> + ? ? ?/* Should have been checked by caller. ?*/
>> + ? ? ?gcc_assert(PARAM_VALUE (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES) != -1);
>
> Should never even happen. You have set the minimum acceptable value to
> 0. If you managed to test this code with
> PARAM_MIN_ITER_UNROLL_WITH_BRANCHES==-1, I'd like to know how (if you
> can do it from the command line, there is a bug in the handling of
> acceptable PARAM values :-)

You are right. This should have been removed - at one point I was
thinking of having a value of -1 indicate that the heuristic should
not kick in, but you are right that the minimum value set in
params.def does not support this, and in fact it is not needed at all
as setting this param to 0 will do the same thing. Will fix!

>
>
>> + ? ? ?/* If the outer loop has enough iterations to be considered hot, then
>> + ? ? ? ? we can stop our upwards loop tree traversal and examine the current
>> + ? ? ? ? outer loop. ?*/
>> + ? ? ?if (outer_niters >= PARAM_VALUE (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES))
>> + ? ? ? ?{
>> + ? ? ? ? ?/* Assume that any call will cause the branch budget to be exceeded,
>> + ? ? ? ? ? ? and that we can't unroll the current loop without increasing
>> + ? ? ? ? ? ? mispredicts. ?*/
>> + ? ? ? ? ?if (loop_has_call(outer))
>> + ? ? ? ? ? ?return 0;
>> +
>> + ? ? ? ? ?/* Otherwise, compute the maximum number of times current loop can be
>> + ? ? ? ? ? ? unrolled without exceeding our branch budget. First we subtract
>> + ? ? ? ? ? ? off the outer loop's weighted branch count from the budget. Note
>> + ? ? ? ? ? ? that this includes the branches in the current loop. This yields
>> + ? ? ? ? ? ? the number of branches left in the budget for the unrolled copies.
>> + ? ? ? ? ? ? We divide this by the number of branches in the current loop that
>> + ? ? ? ? ? ? must be duplicated when we unroll, which is the total weighted
>> + ? ? ? ? ? ? number of branches minus the back-edge branch. This yields the
>> + ? ? ? ? ? ? number of new loop body copies that can be created by unrolling
>> + ? ? ? ? ? ? without exceeding the budget, to which we add 1 to get the unroll
>> + ? ? ? ? ? ? factor. ?*/
>
> ... somewhere here, where weighted_outer_branches is used?

Yep, thanks.

>
>> + ? ? ? ? ?return (PARAM_VALUE (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET) -
>> + ? ? ? ? ? ? ? ? ?weighted_outer_branches)/(weighted_num_branches - 1) + 1;
>
> You should guard against weighted branches==1.0.

Right - the guard already exists at the top of the routine. I can add
a comment here to that effect.

Thanks!
Teresa

>
> Ciao!
> Steven



-- 
Teresa Johnson?|?Software Engineer?|?tejohnson@google.com?|?408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]