[PATCH,vect] ppc cost model options

Mon Nov 26 02:24:00 GMT 2007

>Sorry for not being able to respond earlier to this thread.

The attached patch bootstrapped and passed test on amd64-linux.
Ok for mainline?

Thanks,

>>> >vectorized and such) because the vectorizer cost-model is pending
some
>>> >fixes from Harsha
>>> (http://gcc.gnu.org/ml/gcc-patches/2007-09/msg00916.html
>>> >), to be followed by target-specific tuning of costs.)
>>>
>>> Here is the patch implementing the early cost model check.
>>>
>>
>>great!
>>
>>> The check is added as follows:
>>> - If we do versioning, the threshold comparison is and'ed to the
guard
>>> that controls the versioning,
>>> - Otherwise, if we do peeling for alignment, we can determine the
>>> loop-count of the prolog loop according to the threshold test
>>> - Otherwise, the threshold comparison is or'ed with the guard that
>>> controls whether to bypass the vector code and go straight to the
scalar
>>> epilogue.
>>>
>>
>>The only technical issue I have is with the fact that I think we're
not
>>taking into account that because of the presence of the runtime-test,
>there
>>are costs that should be added also to the scalar side of the equasion
>>(branch costs), and these costs are different in each of the above 3
>cases.
>>(This is something we don't need to worry about when we make the
decision
>>at compile time).  For example, in case 2, even if we choose to
execute
>the
>>scalar loop we pay the cost of 2 branches that did not exist in the
>>original code; in case 3 we pay the cost of 1-3 branches (depending on
>>whether we also peel for alignment and if so, if we also enter the
prolog
>>loop). (We also pay for these branches on the vector side of the
equasion,
>>but this is already taken into account). So we actually have 4
slighlty
>>different equasions:
>>
>>equasion for compile-time test:
>>      scalar_costs >= vector_costs
>>
>>equasion for run-time test, case 1:
>>      scalar_costs
>>      + versioning_guard_cost >= vector_costs
>>
>>equasion for run-time test, case 2:
>>      scalar_costs
>>      + cost_of_guard_for_prolog_peel >= vector_costs
>>
>>equasion for run-time test, case 3:
>>      scalar_costs
>>      + cost_of_guard_for_prolog_peel
>>      + cost_of_guard_for_epilog_peel >= vector_costs
>
>Just to clarify
>
>In case 1, the check is:
>	if (cost > th & ...)
>	  jmp to vector code
>Hence the run-time scalar cost should be incremented by a NOT-TAKEN
branch.
>
>In case 2, the check is:
>  if (cost <= th)
>    prologue=scalar_iters
>  if (prologue == 0)
>    jmp to vector code
>  else
>    execute prologue
>    if (prologue == num_iters)
>	go to exit
>
>Hence the run-time scalar cost should be incremented by a taken branch,
>which on architectures like x86-64 should be a conditional move, plus a
>NOT-TAKEN branch, plus a TAKEN BRANCH.
>
>In case 3, the check is:
>  if (prologue == 0)
>    jmp to vector code
>  else
>    execute prologue
>    if prologue=num_iters
>	go to exit
>  vector code:
>  if (cost <=th | (scalar_iters-prologue-epilogue == 0))
>    jmp to epilogue
>
>Hence the run-time scalar cost should be incremented by a (taken branch
or
>2 not-taken branches for prologue) plus taken branch.
>
>>Do you think we can add these 3 different fixups somehow, depending on
>>where the run-time test is actually placed? (this can be important for
>>platforms like the SPU, where branches can be very costly, such that
we
>may
>>choose to vectorize the less efficient vector version and not pay the
>>branch cost involved with branching to skip the vector code).
>
>Indeed, I will add the costs.
>
>>
>>The rest are just various style issues:
>>
>>>  * tree-vect-transform.c (vect_do_peeling_for_loop_bound): Computes
>>>  the treshold if the cost model check has not been done already.
>>
>>treshold -> threshold
>>
>>> Index: tree-vectorizer.c
>>> ===================================================================
>>> --- tree-vectorizer.c (revision 129356)
>>> +++ tree-vectorizer.c (working copy)
>>...
>>>  slpeel_tree_peel_loop_to_edge (struct loop *loop,
>>>            edge e, tree first_niters,
>>>            tree niters, bool update_first_loop_count,
>>> -          unsigned int th)
>>> +          unsigned int th, bool check)
>>
>>Please document the new function argument "check" (I see that a
>>documentation for "th" is also missing; if you feel like it, please
add
>>that too while you're at it...)
>>
>>Actually, maybe we can come up with something more descriptive than
>>"check"?  e.g. "check_profitability", "add_cost_model_check"...
>
>Ok.
>
>>
>>> -  /* 2. Add the guard that controls whether the first loop is
executed.
>>> -        Resulting CFG would be:
>>> +  /* 2.a Add the guard that controls whether the first loop is
>executed.
>>> +         Resulting CFG would be:
>>>
>>> -        bb_before_first_loop:
>>> -        if (FIRST_NITERS == 0) GOTO bb_before_second_loop
>>> -                               GOTO first-loop
>>> +         bb_before_first_loop:
>>> +         if (FIRST_NITERS == 0) GOTO bb_before_second_loop
>>> +                                GOTO first-loop
>>>
>>> -        first_loop:
>>> -        do {
>>> -        } while ...
>>> +         first_loop:
>>> +         do {
>>> +         } while ...
>>>
>>> -        bb_before_second_loop:
>>> +         bb_before_second_loop:
>>>
>>> -        second_loop:
>>> -        do {
>>> -        } while ...
>>> +         second_loop:
>>> +         do {
>>> +         } while ...
>>>
>>> -        orig_exit_bb:
>>> -   */
>>> +         orig_exit_bb:
>>> +  */
>>> +
>>> +
>>> +  /* 2.b Add the cost model check that allows the prologue
>>> +         to iterate for the entire unchanged scalar
>>> +         iterations of the loop in the event that the cost
>>> +         model indicates that the scalar loop is more
>>> +         profitable than the vector one.
>>> +
>>> +         Resulting CFG after prologue peeling would be:
>>> +
>>> +         if (scalar_loop_iterations <= th)
>>> +           FIRST_NITERS = scalar_loop_iterations
>>> +
>>> +         bb_before_first_loop:
>>> +         if (FIRST_NITERS == 0) GOTO bb_before_second_loop
>>> +                                GOTO first-loop
>>> +
>>> +         first_loop:
>>> +         do {
>>> +         } while ...
>>> +
>>> +         bb_before_second_loop:
>>> +
>>> +         second_loop:
>>> +         do {
>>> +         } while ...
>>> +
>>> +         orig_exit_bb:
>>> +  */
>>> +
>>> +  /* 2.c Add the cost model check that allows the epilogue
>>> +         to iterate for the entire unchanged scalar
>>> +         iterations of the loop in the event that the cost
>>> +         model indicates that the scalar loop is more
>>> +         profitable than the vector one.
>>> +
>>> +         Resulting CFG after prologue peeling would be:
>>> +
>>> +         bb_before_first_loop:
>>> +         if ((scalar_loop_iterations <= th)
>>> +             ||
>>> +             FIRST_NITERS == 0) GOTO bb_before_second_loop
>>> +                                GOTO first-loop
>>> +
>>> +         first_loop:
>>> +         do {
>>> +         } while ...
>>> +
>>> +         bb_before_second_loop:
>>> +
>>> +         second_loop:
>>> +         do {
>>> +         } while ...
>>> +
>>> +         orig_exit_bb:
>>> +  */
>>>
>>
>>Just so it's clear that this function does only one of the above per
>>invocation, maybe write: "2. Add the control/guard code in one of the
>>following ways: 2.a. ..."
>
>Ok.
>
>>Also, I was wondering if it would be clearer if the description was
more
>>compact - i.e. instead of duplicating 3 times the pseudo-code that
remains
>>the same in all 3 alternatives, do something along these lines:
>>"
>>2. Add the control/guard code. Resulting CFG would be:
>>
>>         bb_before_first_loop:
>>         if (COND) GOTO bb_before_second_loop
>>                   GOTO first-loop
>>
>>         first_loop:
>>         do {
>>         } while ...
>>
>>         bb_before_second_loop:
>>
>>         second_loop:
>>         do {
>>         } while ...
>>
>>         orig_exit_bb:
>>
>>The following 3 variants are supported:
>>2a. The guard controls whether the first loop is executed.
>>       COND: FIRST_NITERS == 0
>>2b. Add the cost model check that allows the prologue to iterate....
>>      if (scalar_loop_iterations <= th)
>>        FIRST_NITERS = scalar_loop_iterations
>>      COND: FIRST_NITERS == 0
>>2c. Add the cost model check that allows the epilogue to iterate...
>>      COND: (scalar_loop_iterations <= th) || FIRST_NITERS == 0
>>"
>>
>>... but in the end I'm not sure which way is clearer, so I'll leave
the
>>decision to you.
>>
>
>Ok, I will try to make it clearer. But its hard to say which is clearer
>because the same function is used to do different things for prologue
and
>epilogue.
>
>>A general comment about the changes to slpeel_tree_peel_loop_to_edge:
I
>>wonder if this function would be taken out of the vectorizer one day
and
>>merged with the generic peeling utils. In any case, it would be
desirable
>>if instead of manipulating the condition inside the function, the
function
>>would get the condition as an input? (even if it ends up remaining a
>>vectorizer specific function - I think this will be a good thing to
do).
>If
>>this looks like too much work, then please consider factoring out the
new
>>code that builds the condition in 3 different ways, or at least just
the
>>part that creates the if-stmt for the "Peologue peeling case:
>
>Ok.
>
>>
>>> +  /* Epilogue peeling.  */
>>> +  if (!update_first_loop_count)
>>> +    {
>>...
>>> +    }
>>> +  /* Prologue peeling.  */
>>> +  else
>>> +    {
>>> +      if (check)
>>> +         first_niters = conditionally_reset_first_niters (...)
>>> +
>>...
>>> +    }
>>
>>
>>> Index: tree-vect-transform.c
>>> ===================================================================
>>> --- tree-vect-transform.c (revision 129356)
>>> +++ tree-vect-transform.c (working copy)
>>...
>>> @@ -6482,28 +6483,35 @@ vect_do_peeling_for_loop_bound (loop_vec
>>>
>>>    loop_num  = loop->num;
>>>
>>> -  /* Analyze cost to set threshhold for vectorized loop.  */
>>
>>threshhold --> threshold
>>
>>...
>
>
>Ok.
>
>>> +      /* Analyze cost to set threshhold for vectorized loop.  */
>>> +      min_profitable_iters = LOOP_VINFO_COST_MODEL_MIN_ITERS
>>(loop_vinfo);
>>>
>>
>>about the comment - we don't "Analyze cost" here, just use the already
>>precalculcated threshold. (it also repeats in other places).
>>
>>> -  th = (unsigned) min_scalar_loop_bound;
>>> -  if (min_profitable_iters
>>> -      && (!min_scalar_loop_bound
>>> -          || min_profitable_iters > min_scalar_loop_bound))
>>> -    th = (unsigned) min_profitable_iters;
>>> +      min_scalar_loop_bound = ((PARAM_VALUE
(PARAM_MIN_VECT_LOOP_BOUND)
>>> +    * LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 1);
>>>
>>> -  if (((LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
>>> -      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>>> -      && vect_print_dump_info (REPORT_DETAILS))
>>> -    fprintf (vect_dump, "vectorization may not be profitable.");
>>> +      /* Use the cost model only if it is more conservative than
>>> user specified
>>> +  threshold.  */
>>> +      th = (unsigned) min_scalar_loop_bound;
>>> +      if (min_profitable_iters
>>> +   && (!min_scalar_loop_bound
>>> +       || min_profitable_iters > min_scalar_loop_bound))
>>> + th = (unsigned) min_profitable_iters;
>>> +
>>> +      if (th && vect_print_dump_info (REPORT_DETAILS))
>>> + fprintf (vect_dump, "loop bound peeling: vectorization may not be
>>> profitable.");
>>> +    }
>>
>>The above code repeats almost identically 3 times: here, in
>>vect_do_peeling_for_alignment and in vect_loop_versioning. Please it
>factor
>>out...
>>
>Ok.
>
>>> @@ -7031,7 +7078,11 @@ vect_create_cond_for_alias_checks (loop_
>>>     data references that may or may not be aligned.  An additional
>>>     sequence of runtime tests is generated for each pairs of DDRs
whose
>>>     independence was not proven.  The vectorized version of loop is
>>> -   executed only if both alias and alignment tests are passed.  */
>>> +   executed only if both alias and alignment tests are passed.
>>> +
>>> +   The test generated to check which version of loop is executed
>>> +   is modified to check for profitability as indicated by the
>>
>>--> "is modified to *also* check for profitability", right?
>>
>
>Yes.
>
>Thanks,
>Harsha
>
>>thanks,
>>dorit
>>
>>
>>> The patch bootstrapped and passed test on amd64-linux.
>>>
>>> Ok for trunk?
>>>
>>> >
>>> >> Thanks, David
>>> >>
>>> >>      * gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp: Use
>>> >>    -mcpu=970 instead of 7400.
>>> >>
>>> >> Index: ppc-costmodel-vect.exp
>>> >>
===================================================================
>>> >> --- ppc-costmodel-vect.exp   (revision 129469)
>>> >> +++ ppc-costmodel-vect.exp   (working copy)
>>> >> @@ -49,7 +49,7 @@
>>> >>  } else {
>>> >>      if [is-effective-target ilp32] {
>>> >>          # Specify a cpu that supports VMX for compile-only
tests.
>>> >> -        lappend DEFAULT_VECTCFLAGS "-mcpu=7400"
>>> >> +        lappend DEFAULT_VECTCFLAGS "-mcpu=970"
>>> >>      }
>>> >>      set dg-do-what-default compile
>>> >>  }
>>> >
>>> >
>>>
>>> [attachment "0008-costmodel-early-scalar.diff" deleted by Dorit
>>> Nuzman/Haifa/IBM]
>>
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0009-costmodel-early-scalar.diff
Type: application/octet-stream
Size: 30092 bytes
Desc: 0009-costmodel-early-scalar.diff
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20071126/94101697/attachment.obj>