This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH,vect] ppc cost model options


"Jagasia, Harsha" <harsha.jagasia@amd.com> wrote on 25/11/2007 23:31:21:

> >Sorry for not being able to respond earlier to this thread.

> The attached patch bootstrapped and passed test on amd64-linux.
> Ok for mainline?

ok, with the following minor style fixes:

> +     allow all the iterations to be executed in the prologue peeled
> +     scalar loop*/

missing ".  " before end of comment.

> +  /* If the number of iterations is unknown, or the
> +     peeling-for-misalignment amount is unknown, we eill have to
generate

eill --> will

> -  /* Allow targets add additional (outside-of-loop) costs. FORNOW, the
only
> -     information we provide for the target is whether testing against
the
> -     threshold involves a runtime test.  */
> -  if (targetm.vectorize.builtin_vectorization_cost)
> -    {
> -      bool runtime_test = false;
> -
> -      /* If the number of iterations is unknown, or the
> -    peeling-for-misalignment amount is unknown, we eill have to generate
> -    a runtime test to test the loop count against the threshold.  */
> -      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> -     || (byte_misalign < 0))
> -   runtime_test = true;
> -      vec_outside_cost +=)
> -   targetm.vectorize.builtin_vectorization_cost (runtime_test);.
> -      if (vect_print_dump_info (REPORT_DETAILS))
> -   fprintf (vect_dump, "cost model : Adding target out-of-loop cost =
%d",
> -           targetm.vectorize.builtin_vectorization_cost (runtime_test));

the above was the only usage of
targetm.vectorize.builtin_vectorization_cost; since we are now removing it,
we should also remove the respective parts in target.h, target.def etc.
This can be done in a followup patch (maybe after we're done tuning on the
different platforms, in case we'd find we need such a builtin after all).

> +  if (runtime_test)
> +    {
> +      /* Cost model check occurs at versioning.  */
> +      if (VEC_length (tree, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> +     || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> +   scalar_outside_cost += TARG_COND_NOT_TAKEN_BRANCH_COST;
> +      else
> +   {
> +	  /* Cost model occurs at prologue generation.  */
> +     if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)).
> +       scalar_outside_cost += 2 * TARG_COND_NOT_TAKEN_BRANCH_COST))
> +         + TARG_COND_NOT_TAKEN_BRANCH_COST;
> +	  /* Cost model check occurs at epilogue generation.  */
> +     else
> +       scalar_outside_cost += 2 * TARG_COND_NOT_TAKEN_BRANCH_COST;)
> +	}

maybe add as documentation the parts from this thread below that explain
the TAKEN/NOT_TAKEN stuff? (in practice, the back end may decide to reorder
BBs differently and reverse conditions/branch directions, so I don't know
how much we can rely on this TAKEN/NOT_TAKEN analysis...).

> +     PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
> +     SOC = scalar outside cost for run time cost cost model check.  */

cost cost --> cost

> +      /* Set profitability threshhold for vectorized loop.  */

threshhold --> threshold
Set --> Get
(the above appears several times)


thanks,
dorit

> Thanks,

> >>> >vectorized and such) because the vectorizer cost-model is pending
> some
> >>> >fixes from Harsha
> >>> (http://gcc.gnu.org/ml/gcc-patches/2007-09/msg00916.html
> >>> >), to be followed by target-specific tuning of costs.)
> >>>
> >>> Here is the patch implementing the early cost model check.
> >>>
> >>
> >>great!
> >>
> >>> The check is added as follows:
> >>> - If we do versioning, the threshold comparison is and'ed to the
> guard
> >>> that controls the versioning,
> >>> - Otherwise, if we do peeling for alignment, we can determine the
> >>> loop-count of the prolog loop according to the threshold test
> >>> - Otherwise, the threshold comparison is or'ed with the guard that
> >>> controls whether to bypass the vector code and go straight to the
> scalar
> >>> epilogue.
> >>>
> >>
> >>The only technical issue I have is with the fact that I think we're
> not
> >>taking into account that because of the presence of the runtime-test,
> >there
> >>are costs that should be added also to the scalar side of the equasion
> >>(branch costs), and these costs are different in each of the above 3
> >cases.
> >>(This is something we don't need to worry about when we make the
> decision
> >>at compile time).  For example, in case 2, even if we choose to
> execute
> >the
> >>scalar loop we pay the cost of 2 branches that did not exist in the
> >>original code; in case 3 we pay the cost of 1-3 branches (depending on
> >>whether we also peel for alignment and if so, if we also enter the
> prolog
> >>loop). (We also pay for these branches on the vector side of the
> equasion,
> >>but this is already taken into account). So we actually have 4
> slighlty
> >>different equasions:
> >>
> >>equasion for compile-time test:
> >>      scalar_costs >= vector_costs
> >>
> >>equasion for run-time test, case 1:
> >>      scalar_costs
> >>      + versioning_guard_cost >= vector_costs
> >>
> >>equasion for run-time test, case 2:
> >>      scalar_costs
> >>      + cost_of_guard_for_prolog_peel >= vector_costs
> >>
> >>equasion for run-time test, case 3:
> >>      scalar_costs
> >>      + cost_of_guard_for_prolog_peel
> >>      + cost_of_guard_for_epilog_peel >= vector_costs
> >
> >Just to clarify
> >
> >In case 1, the check is:
> > if (cost > th & ...)
> >   jmp to vector code
> >Hence the run-time scalar cost should be incremented by a NOT-TAKEN
> branch.
> >
> >In case 2, the check is:
> >  if (cost <= th)
> >    prologue=scalar_iters
> >  if (prologue == 0)
> >    jmp to vector code
> >  else
> >    execute prologue
> >    if (prologue == num_iters)
> > go to exit
> >
> >Hence the run-time scalar cost should be incremented by a taken branch,
> >which on architectures like x86-64 should be a conditional move, plus a
> >NOT-TAKEN branch, plus a TAKEN BRANCH.
> >
> >In case 3, the check is:
> >  if (prologue == 0)
> >    jmp to vector code
> >  else
> >    execute prologue
> >    if prologue=num_iters
> > go to exit
> >  vector code:
> >  if (cost <=th | (scalar_iters-prologue-epilogue == 0))
> >    jmp to epilogue
> >
> >Hence the run-time scalar cost should be incremented by a (taken branch
> or
> >2 not-taken branches for prologue) plus taken branch.
> >
> >>Do you think we can add these 3 different fixups somehow, depending on
> >>where the run-time test is actually placed? (this can be important for
> >>platforms like the SPU, where branches can be very costly, such that
> we
> >may
> >>choose to vectorize the less efficient vector version and not pay the
> >>branch cost involved with branching to skip the vector code).
> >
> >Indeed, I will add the costs.
> >
> >>
> >>The rest are just various style issues:
> >>
> >>>  * tree-vect-transform.c (vect_do_peeling_for_loop_bound): Computes
> >>>  the treshold if the cost model check has not been done already.
> >>
> >>treshold -> threshold
> >>
> >>> Index: tree-vectorizer.c
> >>> ===================================================================
> >>> --- tree-vectorizer.c (revision 129356)
> >>> +++ tree-vectorizer.c (working copy)
> >>...
> >>>  slpeel_tree_peel_loop_to_edge (struct loop *loop,
> >>>            edge e, tree first_niters,
> >>>            tree niters, bool update_first_loop_count,
> >>> -          unsigned int th)
> >>> +          unsigned int th, bool check)
> >>
> >>Please document the new function argument "check" (I see that a
> >>documentation for "th" is also missing; if you feel like it, please
> add
> >>that too while you're at it...)
> >>
> >>Actually, maybe we can come up with something more descriptive than
> >>"check"?  e.g. "check_profitability", "add_cost_model_check"...
> >
> >Ok.
> >
> >>
> >>> -  /* 2. Add the guard that controls whether the first loop is
> executed.
> >>> -        Resulting CFG would be:
> >>> +  /* 2.a Add the guard that controls whether the first loop is
> >executed.
> >>> +         Resulting CFG would be:
> >>>
> >>> -        bb_before_first_loop:
> >>> -        if (FIRST_NITERS == 0) GOTO bb_before_second_loop
> >>> -                               GOTO first-loop
> >>> +         bb_before_first_loop:
> >>> +         if (FIRST_NITERS == 0) GOTO bb_before_second_loop
> >>> +                                GOTO first-loop
> >>>
> >>> -        first_loop:
> >>> -        do {
> >>> -        } while ...
> >>> +         first_loop:
> >>> +         do {
> >>> +         } while ...
> >>>
> >>> -        bb_before_second_loop:
> >>> +         bb_before_second_loop:
> >>>
> >>> -        second_loop:
> >>> -        do {
> >>> -        } while ...
> >>> +         second_loop:
> >>> +         do {
> >>> +         } while ...
> >>>
> >>> -        orig_exit_bb:
> >>> -   */
> >>> +         orig_exit_bb:
> >>> +  */
> >>> +
> >>> +
> >>> +  /* 2.b Add the cost model check that allows the prologue
> >>> +         to iterate for the entire unchanged scalar
> >>> +         iterations of the loop in the event that the cost
> >>> +         model indicates that the scalar loop is more
> >>> +         profitable than the vector one.
> >>> +
> >>> +         Resulting CFG after prologue peeling would be:
> >>> +
> >>> +         if (scalar_loop_iterations <= th)
> >>> +           FIRST_NITERS = scalar_loop_iterations
> >>> +
> >>> +         bb_before_first_loop:
> >>> +         if (FIRST_NITERS == 0) GOTO bb_before_second_loop
> >>> +                                GOTO first-loop
> >>> +
> >>> +         first_loop:
> >>> +         do {
> >>> +         } while ...
> >>> +
> >>> +         bb_before_second_loop:
> >>> +
> >>> +         second_loop:
> >>> +         do {
> >>> +         } while ...
> >>> +
> >>> +         orig_exit_bb:
> >>> +  */
> >>> +
> >>> +  /* 2.c Add the cost model check that allows the epilogue
> >>> +         to iterate for the entire unchanged scalar
> >>> +         iterations of the loop in the event that the cost
> >>> +         model indicates that the scalar loop is more
> >>> +         profitable than the vector one.
> >>> +
> >>> +         Resulting CFG after prologue peeling would be:
> >>> +
> >>> +         bb_before_first_loop:
> >>> +         if ((scalar_loop_iterations <= th)
> >>> +             ||
> >>> +             FIRST_NITERS == 0) GOTO bb_before_second_loop
> >>> +                                GOTO first-loop
> >>> +
> >>> +         first_loop:
> >>> +         do {
> >>> +         } while ...
> >>> +
> >>> +         bb_before_second_loop:
> >>> +
> >>> +         second_loop:
> >>> +         do {
> >>> +         } while ...
> >>> +
> >>> +         orig_exit_bb:
> >>> +  */
> >>>
> >>
> >>Just so it's clear that this function does only one of the above per
> >>invocation, maybe write: "2. Add the control/guard code in one of the
> >>following ways: 2.a. ..."
> >
> >Ok.
> >
> >>Also, I was wondering if it would be clearer if the description was
> more
> >>compact - i.e. instead of duplicating 3 times the pseudo-code that
> remains
> >>the same in all 3 alternatives, do something along these lines:
> >>"
> >>2. Add the control/guard code. Resulting CFG would be:
> >>
> >>         bb_before_first_loop:
> >>         if (COND) GOTO bb_before_second_loop
> >>                   GOTO first-loop
> >>
> >>         first_loop:
> >>         do {
> >>         } while ...
> >>
> >>         bb_before_second_loop:
> >>
> >>         second_loop:
> >>         do {
> >>         } while ...
> >>
> >>         orig_exit_bb:
> >>
> >>The following 3 variants are supported:
> >>2a. The guard controls whether the first loop is executed.
> >>       COND: FIRST_NITERS == 0
> >>2b. Add the cost model check that allows the prologue to iterate....
> >>      if (scalar_loop_iterations <= th)
> >>        FIRST_NITERS = scalar_loop_iterations
> >>      COND: FIRST_NITERS == 0
> >>2c. Add the cost model check that allows the epilogue to iterate...
> >>      COND: (scalar_loop_iterations <= th) || FIRST_NITERS == 0
> >>"
> >>
> >>... but in the end I'm not sure which way is clearer, so I'll leave
> the
> >>decision to you.
> >>
> >
> >Ok, I will try to make it clearer. But its hard to say which is clearer
> >because the same function is used to do different things for prologue
> and
> >epilogue.
> >
> >>A general comment about the changes to slpeel_tree_peel_loop_to_edge:
> I
> >>wonder if this function would be taken out of the vectorizer one day
> and
> >>merged with the generic peeling utils. In any case, it would be
> desirable
> >>if instead of manipulating the condition inside the function, the
> function
> >>would get the condition as an input? (even if it ends up remaining a
> >>vectorizer specific function - I think this will be a good thing to
> do).
> >If
> >>this looks like too much work, then please consider factoring out the
> new
> >>code that builds the condition in 3 different ways, or at least just
> the
> >>part that creates the if-stmt for the "Peologue peeling case:
> >
> >Ok.
> >
> >>
> >>> +  /* Epilogue peeling.  */
> >>> +  if (!update_first_loop_count)
> >>> +    {
> >>...
> >>> +    }
> >>> +  /* Prologue peeling.  */
> >>> +  else
> >>> +    {
> >>> +      if (check)
> >>> +         first_niters = conditionally_reset_first_niters (...)
> >>> +
> >>...
> >>> +    }
> >>
> >>
> >>> Index: tree-vect-transform.c
> >>> ===================================================================
> >>> --- tree-vect-transform.c (revision 129356)
> >>> +++ tree-vect-transform.c (working copy)
> >>...
> >>> @@ -6482,28 +6483,35 @@ vect_do_peeling_for_loop_bound (loop_vec
> >>>
> >>>    loop_num  = loop->num;
> >>>
> >>> -  /* Analyze cost to set threshhold for vectorized loop.  */
> >>
> >>threshhold --> threshold
> >>
> >>...
> >
> >
> >Ok.
> >
> >>> +      /* Analyze cost to set threshhold for vectorized loop.  */
> >>> +      min_profitable_iters = LOOP_VINFO_COST_MODEL_MIN_ITERS
> >>(loop_vinfo);
> >>>
> >>
> >>about the comment - we don't "Analyze cost" here, just use the already
> >>precalculcated threshold. (it also repeats in other places).
> >>
> >>> -  th = (unsigned) min_scalar_loop_bound;
> >>> -  if (min_profitable_iters
> >>> -      && (!min_scalar_loop_bound
> >>> -          || min_profitable_iters > min_scalar_loop_bound))
> >>> -    th = (unsigned) min_profitable_iters;
> >>> +      min_scalar_loop_bound = ((PARAM_VALUE
> (PARAM_MIN_VECT_LOOP_BOUND)
> >>> +    * LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 1);
> >>>
> >>> -  if (((LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
> >>> -      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> >>> -      && vect_print_dump_info (REPORT_DETAILS))
> >>> -    fprintf (vect_dump, "vectorization may not be profitable.");
> >>> +      /* Use the cost model only if it is more conservative than
> >>> user specified
> >>> +  threshold.  */
> >>> +      th = (unsigned) min_scalar_loop_bound;
> >>> +      if (min_profitable_iters
> >>> +   && (!min_scalar_loop_bound
> >>> +       || min_profitable_iters > min_scalar_loop_bound))
> >>> + th = (unsigned) min_profitable_iters;
> >>> +
> >>> +      if (th && vect_print_dump_info (REPORT_DETAILS))
> >>> + fprintf (vect_dump, "loop bound peeling: vectorization may not be
> >>> profitable.");
> >>> +    }
> >>
> >>The above code repeats almost identically 3 times: here, in
> >>vect_do_peeling_for_alignment and in vect_loop_versioning. Please it
> >factor
> >>out...
> >>
> >Ok.
> >
> >>> @@ -7031,7 +7078,11 @@ vect_create_cond_for_alias_checks (loop_
> >>>     data references that may or may not be aligned.  An additional
> >>>     sequence of runtime tests is generated for each pairs of DDRs
> whose
> >>>     independence was not proven.  The vectorized version of loop is
> >>> -   executed only if both alias and alignment tests are passed.  */
> >>> +   executed only if both alias and alignment tests are passed.
> >>> +
> >>> +   The test generated to check which version of loop is executed
> >>> +   is modified to check for profitability as indicated by the
> >>
> >>--> "is modified to *also* check for profitability", right?
> >>
> >
> >Yes.
> >
> >Thanks,
> >Harsha
> >
> >>thanks,
> >>dorit
> >>
> >>
> >>> The patch bootstrapped and passed test on amd64-linux.
> >>>
> >>> Ok for trunk?
> >>>
> >>> >
> >>> >> Thanks, David
> >>> >>
> >>> >>      * gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp: Use
> >>> >>    -mcpu=970 instead of 7400.
> >>> >>
> >>> >> Index: ppc-costmodel-vect.exp
> >>> >>
> ===================================================================
> >>> >> --- ppc-costmodel-vect.exp   (revision 129469)
> >>> >> +++ ppc-costmodel-vect.exp   (working copy)
> >>> >> @@ -49,7 +49,7 @@
> >>> >>  } else {
> >>> >>      if [is-effective-target ilp32] {
> >>> >>          # Specify a cpu that supports VMX for compile-only
> tests.
> >>> >> -        lappend DEFAULT_VECTCFLAGS "-mcpu=7400"
> >>> >> +        lappend DEFAULT_VECTCFLAGS "-mcpu=970"
> >>> >>      }
> >>> >>      set dg-do-what-default compile
> >>> >>  }
> >>> >
> >>> >
> >>>
> >>> [attachment "0008-costmodel-early-scalar.diff" deleted by Dorit
> >>> Nuzman/Haifa/IBM]
> >>
> >>

>
> #### 0009-costmodel-early-scalar.diff has been deleted (was saved in
> repository MyAttachments Repository ->) from this note on 01
> December 2007 by Dorit Nuzman


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]