[PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

Wed Aug 21 13:42:00 GMT 2019

On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi!
>
> Comparing to the previous versions of implementation mainly based on the
> existing IV cands but zeroing the related group/use cost, this new one is based
> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>
> Some key points are listed below:
>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>   2) Special name "doloop" assigned.
>   3) Doloop IV cand with form (niter+1, +, -1)
>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>      calculation on the IV base with may_be_zero (like COND_EXPR).
>   7) Set zero cost when using doloop IV cand for doloop use.
>   8) Add three hooks (should we merge _generic and _address?).
>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>        count register, we shouldn't consider the impact of doloop IV when
>        calculating register pressures.
>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>        generic type IV use.
>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>        address type IV use.
What will happen if doloop IV cand be used for generic/address type iv
use?  Can RTL doloop can still perform doloop optimization in this
case?

>
> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
> by PR89983.
>
> Any comments and suggestions are highly appreciated.  Thanks!
Not sure if I understand the patch correctly, some comments embedded.

+  /* The number of doloop candidate in the set.  */
+  unsigned n_doloop_cands;
+
This is unnecessary.  See below comments.

-    add_candidate_1 (data, base, step, important,
-                    IP_NORMAL, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+                    orig_iv);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
+                    orig_iv);
Do we need to skip ip_end_pos case for doloop candidate?  Because the
candidate increment will be inserted in latch, i.e, increment position
is after exit condition.

-  tree_to_aff_combination (iv->base, type, val);
+  tree base = iv->base;
+  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
+     the value under !may_be_zero to get the compact bound which also well fits
+     for may_be_zero since we ensure the value for it is const one.  */
+  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
(desc->may_be_zero))
+    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
+                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
+                       build_int_cst (TREE_TYPE (niter), 1));
+  tree_to_aff_combination (base, type, val);
I don't quite follow here.  The iv->base is computed from niter, I
suppose compact bound is for cheaper candidate initialization?  Why
it's possible to extract !may_be_zero niter for may_be_zero here?  The
niter under !may_be_zero has no indication about the real niter under
may_be_zero.

-  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
+  cand_value_at (loop, cand, use->stmt, desc, &bnd);
If I understand correctly, doloop use/cand will only be
identified/added for single exit loop, and there will be only one
cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
cand_value at niter at use position would be 0.  If that's the case,
we can skip calling cand_value_at here for doloop cand.  The change to
cand_value_at would be unnecessary neither.

-          expensive.  */
-  if (!integer_zerop (desc->may_be_zero))
+          expensive.
+
+     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
+     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
+  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
     return iv_elimination_compare_lt (data, cand, comp, desc);
And we can early return before this?

+  if (may_be_zero)
+    {
+      if (COMPARISON_CLASS_P (may_be_zero))
+       {
+         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
+                              build_int_cst (ntype, 0),
+                              rewrite_to_non_trapping_overflow (niter));
+       }
+      /* Don't try to obtain the iteration count expression when may_be_zero is
+        integer_nonzerop (actually iteration count is one) or else.  */
+      else
+       return;
+    }
+
+  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
+                          build_int_cst (ntype, 1));
niter is the number of latch executions, so niter + 1 could wrap here,
but guess it's not a problem the similar issue is not handled in
vectorizer neither.

+  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
+  /* If target supports count register for doloop, it doesn't take GPR.  */
+  if (targetm.have_count_reg_decr_p)
+    n_spr_for_doloop = n_doloop_cands;
+  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
Not necessary.  See below.

-  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
+  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
+                                       ivs->n_doloop_cands);
Also.

       ivs->n_cands--;
+      if (cp->cand->doloop_p)
+       ivs->n_doloop_cands--;

          ivs->n_cands++;
+         if (cp->cand->doloop_p)
+           ivs->n_doloop_cands++;
You can just book n_cands under condition !cp->cand->doloop_p.

+  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
+    {
+      if (find_doloop_use (data))
+       {
+         data->doloop_use_p = true;
+         if (dump_file && (dump_flags & TDF_DETAILS))
+           {
+             fprintf (dump_file,
+                      "Predict loop %d can perform"
+                      " doloop optimization later.\n",
+                      loop->num);
+             flow_loop_dump (loop, dump_file, NULL, 1);
+           }
+       }
+    }
+
Please factor this into a function to keep caller short.

Thanks,
bin

>
> Kewen
>
> ---------
>
> gcc/ChangeLog
>
> 2019-08-14  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * target.def (have_count_reg_decr_p): New hook.
>         (doloop_cost_for_generic): Likewise.
>         (doloop_cost_for_address): Likewise.
>         * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * doc/tm.texi: Regenerate.
>         * tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
>         addend.
>         (record_group): Init doloop_p.
>         (add_candidate_1): Add optional argument doloop, change the handlings
>         accordingly.
>         (add_candidate): Likewise.
>         (add_iv_candidate_for_biv): Update the call to add_candidate.
>         (generic_predict_doloop_p): Update attribute.
>         (force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
>         LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
>         UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
>         MIN_EXPR.
>         (determine_group_iv_cost_generic): Update for doloop IV cand.
>         (determine_group_iv_cost_address): Likewise.
>         (determine_group_iv_cost_cond): Likewise.
>         (determine_iv_cost): Likewise.
>         (ivopts_estimate_reg_pressure): Likewise.
>         (cand_value_at): Update argument niter type to struct tree_niter_desc*,
>         consider doloop IV cand and may_be_zero.
>         (may_eliminate_iv): Update the call to cand_value_at, consider doloop
>         IV cand and may_be_zero.
>         (add_iv_candidate_for_doloop): New function.
>         (find_iv_candidates): Call function add_iv_candidate_for_doloop.
>         (determine_set_costs): Update the call to ivopts_estimate_reg_pressure.
>         (iv_ca_recount_cost): Likewise.
>         (iv_ca_new): Init n_doloop_cands.
>         (iv_ca_set_no_cp): Update n_doloop_cands.
>         (iv_ca_set_cp): Likewise.
>         (iv_ca_dump): Dump register cost.
>         (find_doloop_use): Likewise.
>         (tree_ssa_iv_optimize_loop): Call function generic_predict_doloop_p and
>         find_doloop_use.
>
> gcc/testsuite/ChangeLog
>
> 2019-08-14  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
>         * gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
>         * gcc.dg/tree-ssa/pr32044.c: Likewise.
>