target cost model tuning for x86

Sun Sep 9 20:14:00 GMT 2007

hi Harsha,

thanks for the code examples,

....
> Conservatively, I think it amounts to 1 taken and 1 not-taken branch in
> the epilogue and prologue each. Please correct me if you think if I am
> generalizing this incorrectly.
>
> If you think its ok, then perhaps the guards should be qualified as a
> taken and not-taken. In that case, we can count 2 guards for the
> run-time case for the prologue and epilogue each, but 1 guard will be
> counted as taken with a higher cost and the other guard will be counted
> as not-taken with a lower cost. The not-taken guard still involves a
> compare of some sort and that should be counted.
>

(it also involves other costs that are a result of the fact that branches
may prevent other optimizations that would have otherwise been applicable).
But I agree that it is a good idea to allow to model different costs for
taken and non-taken branches. The SPU is very sensitive to whether the
branch is taken or not (it has no branch prediction and assumes a
'non-taken' behavior by default, so these taken branches will be very
painful there), other targets may be less sensitive. Anyhow, each target
could assign it's own cost.

So overall, I think we agree to count (TAKEN_BRANCH_COST +
NON_TAKEN_BRANCH_COST) per each peel loop, with the exception that if the
number of the peel loop iterations is known at compile time, then at least
one branch will be eliminated; if in addition the number-of-iterations of
the main vector loop is also known, more branches can be eliminated.

In fact, we can maybe do something better. If we look at the guard that
controls whether we enter the vectorized loop or skip to the epilog:
      if (vect_niters > th)
         then fall through to vector-loop
         else skip to scalar epilog-loop
we see that if we enter the vectorized-loop we follow the fall-through path
(non-taken branch),
whereas if we choose to skip the vectorized-loop we take the branch.
So in other words, when we compare the costs of executing the scalar loop
vs. the cost of executing the vector loop, we should add the cost of the
non-taken-branch to the vector-loop side of the equation,
and add the cost of the taken-branch to scalar-loop side of the equation.
And similarly, if we skipped the vector loop we just continue directly to
the epilog loop, whereas if we took the vector-loop path, at the exit from
the vector-loop we have another branch that may or may not be taken. So
overall (and assuming that loop-counts are unknown), instead of just adding
the epilog guard-code costs to the vector side of the equation, we should
rather break it into both sides of the equation, i.e. something roughly
like this:
      scalar_loop_costs + taken_branch_cost >=
      vector_loop_costs + non_taken_branch_cost + average_branch_cost

If we do this, we could eliminate the
targetm.vectorize.builtin_vectorization_cost stuff, cause this was the main
motivation why it was added in the first place - to reflect the fact that
there may be a run-time cost to choosing to skip the vector-loop.

In fact, since the runtime-test makes the decision on whether or not to
execute the vector-loop only after the prolog-loop had already been
executed, the cost of the prolog guard-code should also be added to the
scalar side of the equation (or rather not be added to either side of the
equation). Similarly, the scalar-loop will also suffer from the branch-cost
associated with loop-versioning...

Just to clarify - all of the above is referring to the case that we
generate a run-time test (as opposed to making the decision at compile
time, in which case we can and should consider all the penalties only on
the vector side). This is because of the way we do the run-time test (it's
not an extra branch added before the entire loop (with its versioning and
peeling) but rather reuses and existing branch. Which makes we wonder if
(1) it makes sense to compute two different thresholds - one for the
compile time test and one for the runtime test, as they already compare the
threshold against different values, and (2) if it would be better to place
the runtime test earlier, so that the scalar loop would suffer from less
penalties...

thanks,
dorit

> Thanks,
> Harsha
>
>