This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts


Hi Bin,

on 2019/8/22 下午1:46, Bin.Cheng wrote:
> On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Bin,
>>
>> Thanks for your time!
>>
>> on 2019/8/21 下午8:32, Bin.Cheng wrote:
>>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi!
>>>>
>>>> Comparing to the previous versions of implementation mainly based on the
>>>> existing IV cands but zeroing the related group/use cost, this new one is based
>>>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>>>>
>>>> Some key points are listed below:
>>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>>>>   2) Special name "doloop" assigned.
>>>>   3) Doloop IV cand with form (niter+1, +, -1)
>>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>>>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>>>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>>>>      calculation on the IV base with may_be_zero (like COND_EXPR).
>>>>   7) Set zero cost when using doloop IV cand for doloop use.
>>>>   8) Add three hooks (should we merge _generic and _address?).
>>>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>>>>        count register, we shouldn't consider the impact of doloop IV when
>>>>        calculating register pressures.
>>>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>>>>        generic type IV use.
>>>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>>>>        address type IV use.
>>> What will happen if doloop IV cand be used for generic/address type iv
>>> use?  Can RTL doloop can still perform doloop optimization in this
>>> case?
>>>
>>
>> On Power, we put the iteration count into hardware count register, it takes very
>> high cost to move the count to GPR, so the cost is set as INF to make it impossible
>> to use it for generic/address type iv use.  But as some discussion before, on some
>> targets using GPR instead of hardware count register, they probably want to use this
>> doloop iv used for other uses if profitable.  These two hooks offer the possibility.
>> In that case, I think RTL doloop can still perform since it can still get the
>> pattern and transform.  The generic/address uses can still use it.
>>>>
>>>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
>>>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
>>>> by PR89983.
>>>>
>>>> Any comments and suggestions are highly appreciated.  Thanks!
>>> Not sure if I understand the patch correctly, some comments embedded.
>>>
>>> +  /* The number of doloop candidate in the set.  */
>>> +  unsigned n_doloop_cands;
>>> +
>>> This is unnecessary.  See below comments.
>>>
>>> -    add_candidate_1 (data, base, step, important,
>>> -                    IP_NORMAL, use, NULL, orig_iv);
>>> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
>>> +                    orig_iv);
>>>    if (ip_end_pos (data->current_loop)
>>>        && allow_ip_end_pos_p (data->current_loop))
>>> -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
>>> +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
>>> +                    orig_iv);
>>> Do we need to skip ip_end_pos case for doloop candidate?  Because the
>>> candidate increment will be inserted in latch, i.e, increment position
>>> is after exit condition.
>>>
>>
>> Yes, we should skip it.  Currently function find_doloop_use has the check on an
>> empty latch and gimple_cond to latch, partially excluding it.  But it's still good
>> to guard it directly here.
>>
>>> -  tree_to_aff_combination (iv->base, type, val);
>>> +  tree base = iv->base;
>>> +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
>>> +     the value under !may_be_zero to get the compact bound which also well fits
>>> +     for may_be_zero since we ensure the value for it is const one.  */
>>> +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
>>> (desc->may_be_zero))
>>> +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
>>> +                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
>>> +                       build_int_cst (TREE_TYPE (niter), 1));
>>> +  tree_to_aff_combination (base, type, val);
>>> I don't quite follow here.  The iv->base is computed from niter, I
>>> suppose compact bound is for cheaper candidate initialization?  Why
>>> it's possible to extract !may_be_zero niter for may_be_zero here?  The
>>> niter under !may_be_zero has no indication about the real niter under
>>> may_be_zero.
>>>
>>
>> As you note below, the cand_value for doloop would be zero, but for the case
>> may_be_zero set, the current calculation would take care of the whole niter
>> expression including the cond_expr introduced by may_be_zero check, it's
>> unexpected.  The purpose is to use the value under condition !may_be_zero
>> for the calculation, and yes, to get expected zero finally.
>>
>>> -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
>>> +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
>>> If I understand correctly, doloop use/cand will only be
>>> identified/added for single exit loop, and there will be only one
>>> cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
>>> cand_value at niter at use position would be 0.  If that's the case,
>>> we can skip calling cand_value_at here for doloop cand.  The change to
>>> cand_value_at would be unnecessary neither.
>>>
>>
>> Exactly, I'll add the early return with zero bound for doloop.
>>
>>> -          expensive.  */
>>> -  if (!integer_zerop (desc->may_be_zero))
>>> +          expensive.
>>> +
>>> +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
>>> +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
>>> +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
>>>      return iv_elimination_compare_lt (data, cand, comp, desc);
>>> And we can early return before this?
>>>
>>
>> OK.
>>
>>> +  if (may_be_zero)
>>> +    {
>>> +      if (COMPARISON_CLASS_P (may_be_zero))
>>> +       {
>>> +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
>>> +                              build_int_cst (ntype, 0),
>>> +                              rewrite_to_non_trapping_overflow (niter));
>>> +       }
>>> +      /* Don't try to obtain the iteration count expression when may_be_zero is
>>> +        integer_nonzerop (actually iteration count is one) or else.  */
>>> +      else
>>> +       return;
>>> +    }
>>> +
>>> +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> +                          build_int_cst (ntype, 1));
>>> niter is the number of latch executions, so niter + 1 could wrap here,
>>> but guess it's not a problem the similar issue is not handled in
>>> vectorizer neither.
>>>
>>
>> OK.
>>
>>> +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
>>> +  /* If target supports count register for doloop, it doesn't take GPR.  */
>>> +  if (targetm.have_count_reg_decr_p)
>>> +    n_spr_for_doloop = n_doloop_cands;
>>> +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
>>> Not necessary.  See below.
>>
>>> -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
>>> +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
>>> +                                       ivs->n_doloop_cands);
>>> Also.
>>>
>>>        ivs->n_cands--;
>>> +      if (cp->cand->doloop_p)
>>> +       ivs->n_doloop_cands--;
>>>
>>>           ivs->n_cands++;
>>> +         if (cp->cand->doloop_p)
>>> +           ivs->n_doloop_cands++;
>>> You can just book n_cands under condition !cp->cand->doloop_p.
>>
>> If my understanding is correct, you are suggesting the code like:
>>
>> if (!cp->cand->doloop_p)
>>   ivs->n_cands++;
>>
>> But I'm afraid that it can NOT satisfy the need in function
>> ivopts_estimate_reg_pressure.  As the comments, "if target supports
>> count register for doloop it doesn't take GPR.".  If we make doloop
>> cand invisible in n_cands, it's fine for target with count register,
>> but we may miss to count them on targets without count register.
> Why not one more step do checks:
> if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
>   ivs->n_cands++;
> 

Yes, it works.  Thanks!

The new patch addressing the comments is attached.  
Could you please have a look again?  Thanks in advance!


Kewen

---------

gcc/ChangeLog

2019-08-22  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* target.def (have_count_reg_decr_p): New hook.
	(doloop_cost_for_generic): Likewise.
	(doloop_cost_for_address): Likewise.
	* doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
	addend.
	(record_group): Init doloop_p.
	(add_candidate_1): Add optional argument doloop, change the handlings
	accordingly.
	(add_candidate): Likewise.
	(add_iv_candidate_for_biv): Update the call to add_candidate.
	(generic_predict_doloop_p): Update attribute.
	(force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
	LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
	UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
	MIN_EXPR.
	(determine_group_iv_cost_generic): Update for doloop IV cand.
	(determine_group_iv_cost_address): Likewise.
	(determine_group_iv_cost_cond): Likewise.
	(determine_iv_cost): Likewise.
	(ivopts_estimate_reg_pressure): Likewise.
	(may_eliminate_iv): Likewise.
	(add_iv_candidate_for_doloop): New function.
	(find_iv_candidates): Call function add_iv_candidate_for_doloop.
	(iv_ca_set_no_cp): Update for doloop IV cand.
	(iv_ca_set_cp): Likewise.
	(iv_ca_dump): Dump register cost.
	(find_doloop_use): New function.
	(predict_and_process_doloop): Likewise.
	(tree_ssa_iv_optimize_loop): Call function predict_and_process_doloop.

gcc/testsuite/ChangeLog

2019-08-22  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
	* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
	* gcc.dg/tree-ssa/pr32044.c: Likewise.

Attachment: doloop_dedicated_iv2.diff
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]