[PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

Kewen.Lin linkw@linux.ibm.com
Thu Aug 22 07:09:00 GMT 2019


Hi Bin,

Thanks for your time!

on 2019/8/21 下午8:32, Bin.Cheng wrote:
> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi!
>>
>> Comparing to the previous versions of implementation mainly based on the
>> existing IV cands but zeroing the related group/use cost, this new one is based
>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>>
>> Some key points are listed below:
>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>>   2) Special name "doloop" assigned.
>>   3) Doloop IV cand with form (niter+1, +, -1)
>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>>      calculation on the IV base with may_be_zero (like COND_EXPR).
>>   7) Set zero cost when using doloop IV cand for doloop use.
>>   8) Add three hooks (should we merge _generic and _address?).
>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>>        count register, we shouldn't consider the impact of doloop IV when
>>        calculating register pressures.
>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>>        generic type IV use.
>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>>        address type IV use.
> What will happen if doloop IV cand be used for generic/address type iv
> use?  Can RTL doloop can still perform doloop optimization in this
> case?
> 

On Power, we put the iteration count into hardware count register, it takes very
high cost to move the count to GPR, so the cost is set as INF to make it impossible
to use it for generic/address type iv use.  But as some discussion before, on some
targets using GPR instead of hardware count register, they probably want to use this
doloop iv used for other uses if profitable.  These two hooks offer the possibility.
In that case, I think RTL doloop can still perform since it can still get the 
pattern and transform.  The generic/address uses can still use it.
>>
>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
>> by PR89983.
>>
>> Any comments and suggestions are highly appreciated.  Thanks!
> Not sure if I understand the patch correctly, some comments embedded.
> 
> +  /* The number of doloop candidate in the set.  */
> +  unsigned n_doloop_cands;
> +
> This is unnecessary.  See below comments.
> 
> -    add_candidate_1 (data, base, step, important,
> -                    IP_NORMAL, use, NULL, orig_iv);
> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
> +                    orig_iv);
>    if (ip_end_pos (data->current_loop)
>        && allow_ip_end_pos_p (data->current_loop))
> -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
> +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
> +                    orig_iv);
> Do we need to skip ip_end_pos case for doloop candidate?  Because the
> candidate increment will be inserted in latch, i.e, increment position
> is after exit condition.
> 

Yes, we should skip it.  Currently function find_doloop_use has the check on an
empty latch and gimple_cond to latch, partially excluding it.  But it's still good
to guard it directly here.

> -  tree_to_aff_combination (iv->base, type, val);
> +  tree base = iv->base;
> +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
> +     the value under !may_be_zero to get the compact bound which also well fits
> +     for may_be_zero since we ensure the value for it is const one.  */
> +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
> (desc->may_be_zero))
> +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
> +                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
> +                       build_int_cst (TREE_TYPE (niter), 1));
> +  tree_to_aff_combination (base, type, val);
> I don't quite follow here.  The iv->base is computed from niter, I
> suppose compact bound is for cheaper candidate initialization?  Why
> it's possible to extract !may_be_zero niter for may_be_zero here?  The
> niter under !may_be_zero has no indication about the real niter under
> may_be_zero.
> 

As you note below, the cand_value for doloop would be zero, but for the case
may_be_zero set, the current calculation would take care of the whole niter
expression including the cond_expr introduced by may_be_zero check, it's 
unexpected.  The purpose is to use the value under condition !may_be_zero
for the calculation, and yes, to get expected zero finally.

> -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
> +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
> If I understand correctly, doloop use/cand will only be
> identified/added for single exit loop, and there will be only one
> cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
> cand_value at niter at use position would be 0.  If that's the case,
> we can skip calling cand_value_at here for doloop cand.  The change to
> cand_value_at would be unnecessary neither.
> 

Exactly, I'll add the early return with zero bound for doloop.

> -          expensive.  */
> -  if (!integer_zerop (desc->may_be_zero))
> +          expensive.
> +
> +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
> +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
> +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
>      return iv_elimination_compare_lt (data, cand, comp, desc);
> And we can early return before this?
> 

OK.

> +  if (may_be_zero)
> +    {
> +      if (COMPARISON_CLASS_P (may_be_zero))
> +       {
> +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
> +                              build_int_cst (ntype, 0),
> +                              rewrite_to_non_trapping_overflow (niter));
> +       }
> +      /* Don't try to obtain the iteration count expression when may_be_zero is
> +        integer_nonzerop (actually iteration count is one) or else.  */
> +      else
> +       return;
> +    }
> +
> +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> +                          build_int_cst (ntype, 1));
> niter is the number of latch executions, so niter + 1 could wrap here,
> but guess it's not a problem the similar issue is not handled in
> vectorizer neither.
> 

OK.

> +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
> +  /* If target supports count register for doloop, it doesn't take GPR.  */
> +  if (targetm.have_count_reg_decr_p)
> +    n_spr_for_doloop = n_doloop_cands;
> +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
> Not necessary.  See below.

> -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
> +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
> +                                       ivs->n_doloop_cands);
> Also.
> 
>        ivs->n_cands--;
> +      if (cp->cand->doloop_p)
> +       ivs->n_doloop_cands--;
> 
>           ivs->n_cands++;
> +         if (cp->cand->doloop_p)
> +           ivs->n_doloop_cands++;
> You can just book n_cands under condition !cp->cand->doloop_p.

If my understanding is correct, you are suggesting the code like:

if (!cp->cand->doloop_p)
  ivs->n_cands++;

But I'm afraid that it can NOT satisfy the need in function
ivopts_estimate_reg_pressure.  As the comments, "if target supports
count register for doloop it doesn't take GPR.".  If we make doloop
cand invisible in n_cands, it's fine for target with count register,
but we may miss to count them on targets without count register.

> 
> +  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
> +    {
> +      if (find_doloop_use (data))
> +       {
> +         data->doloop_use_p = true;
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             fprintf (dump_file,
> +                      "Predict loop %d can perform"
> +                      " doloop optimization later.\n",
> +                      loop->num);
> +             flow_loop_dump (loop, dump_file, NULL, 1);
> +           }
> +       }
> +    }
> +
> Please factor this into a function to keep caller short.
> 

OK.


Thanks!
Kewen



More information about the Gcc-patches mailing list