target cost model tuning for x86

Mon Sep 10 04:51:00 GMT 2007

Hi Dorit,

>> Conservatively, I think it amounts to 1 taken and 1 not-taken branch
in
>> the epilogue and prologue each. Please correct me if you think if I
am
>> generalizing this incorrectly.
>>
>> If you think its ok, then perhaps the guards should be qualified as a
>> taken and not-taken. In that case, we can count 2 guards for the
>> run-time case for the prologue and epilogue each, but 1 guard will be
>> counted as taken with a higher cost and the other guard will be
counted
>> as not-taken with a lower cost. The not-taken guard still involves a
>> compare of some sort and that should be counted.
>>
>
>(it also involves other costs that are a result of the fact that
branches
>may prevent other optimizations that would have otherwise been
applicable).
>But I agree that it is a good idea to allow to model different costs
for
>taken and non-taken branches. The SPU is very sensitive to whether the
>branch is taken or not (it has no branch prediction and assumes a
>'non-taken' behavior by default, so these taken branches will be very
>painful there), other targets may be less sensitive. Anyhow, each
target
>could assign it's own cost.
>
>So overall, I think we agree to count (TAKEN_BRANCH_COST +
>NON_TAKEN_BRANCH_COST) per each peel loop, with the exception that if
the
>number of the peel loop iterations is known at compile time, then at
least
>one branch will be eliminated; if in addition the number-of-iterations
of
>the main vector loop is also known, more branches can be eliminated.
>
>In fact, we can maybe do something better. If we look at the guard that
>controls whether we enter the vectorized loop or skip to the epilog:
>      if (vect_niters > th)
>         then fall through to vector-loop
>         else skip to scalar epilog-loop
>we see that if we enter the vectorized-loop we follow the fall-through
path
>(non-taken branch),
>whereas if we choose to skip the vectorized-loop we take the branch.
>So in other words, when we compare the costs of executing the scalar
loop
>vs. the cost of executing the vector loop, we should add the cost of
the
>non-taken-branch to the vector-loop side of the equation,
>and add the cost of the taken-branch to scalar-loop side of the
equation.
>And similarly, if we skipped the vector loop we just continue directly
to
>the epilog loop, whereas if we took the vector-loop path, at the exit
from
>the vector-loop we have another branch that may or may not be taken. So
>overall (and assuming that loop-counts are unknown), instead of just
adding
>the epilog guard-code costs to the vector side of the equation, we
should
>rather break it into both sides of the equation, i.e. something roughly
>like this:
>      scalar_loop_costs + taken_branch_cost >=
>      vector_loop_costs + non_taken_branch_cost + average_branch_cost
>
>If we do this, we could eliminate the
>targetm.vectorize.builtin_vectorization_cost stuff, cause this was the
main
>motivation why it was added in the first place - to reflect the fact
that
>there may be a run-time cost to choosing to skip the vector-loop.
>
>In fact, since the runtime-test makes the decision on whether or not to
>execute the vector-loop only after the prolog-loop had already been
>executed, the cost of the prolog guard-code should also be added to the
>scalar side of the equation (or rather not be added to either side of
the
>equation). Similarly, the scalar-loop will also suffer from the
branch-cost
>associated with loop-versioning...
>
>Just to clarify - all of the above is referring to the case that we
>generate a run-time test (as opposed to making the decision at compile
>time, in which case we can and should consider all the penalties only
on
>the vector side). This is because of the way we do the run-time test
(it's
>not an extra branch added before the entire loop (with its versioning
and
>peeling) but rather reuses and existing branch. Which makes we wonder
if
>(1) it makes sense to compute two different thresholds - one for the
>compile time test and one for the runtime test, as they already compare
the
>threshold against different values, and (2) if it would be better to
place
>the runtime test earlier, so that the scalar loop would suffer from
less
>penalties...

We could place the runtime test earlier, so that the scalar loop would
suffer from less penalties. 

In the linpk example I had mentioned earlier, almost 99% of the time is
spent in the following code:

    Line
                m = MOD(N,4)
    323                IF ( m.NE.0 ) THEN
    324                   DO i = 1 , m
    325                      Dy(i) = Dy(i) + Da*Dx(i)
    326                   ENDDO
    327                   IF ( N.LT.4 ) RETURN
    328                ENDIF
    329                mp1 = m + 1
    330                DO i = mp1 , N , 4
    331                   Dy(i) = Dy(i) + Da*Dx(i)
    332                   Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
    333                   Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
    334                   Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
    335                ENDDO 

The loop at line 324 is the run-time example I had given earlier.
(FWIW, the loop at line 330 currently is not vectorized, but should get
vectorized with the SLP patches and should speed up this benchmark
considerably)

The loop at line 324 makes a small difference to the run time of this
benchmark (measured over several runs). The cost model picks the scalar
version correctly for performance and emits the prologue and epilogue
(skipping the vector part).

		NV	V-cost
seconds	18.90	18.94
oprofile	476	28256
samples

I have seen this happen in other benchmarks too, where the vectorized
version with cost takes a little longer than the scalar version, but not
as long as the vector version.

If we go with it, I suppose this could be a new guard [if
scalar_loop_iters <= th] and we could go either to the epilogue or
prologue. Or we could emit the whole original scalar loop body as an
alternative code path, but then there might be some code bloat. I can
get started with a patch for this, thoughts?

If we go with the early seperate test, the penalties need only be
considered on the vector side even for the run-time case and we will be
able to keep the target.m.builtin.vectorization cost because there will
be atleast one new taken branch for the run time scalar loop. Also both
the compile and run time cases can just use the threshold based on
scalar loop iterations and the cost model equation can remain in terms
of scalar loop iterations. And only the guard costs need to be split
into taken, not taken and considered differently as you suggest for
when: 
- neither alignment is known nor iterations
- alignment is known but iterations are not
- both alignment and iterations are known.

Let me know what you think. I can submit the patch with the changes you
recommended before end of stage 2 tomorrow, except what we decide to do
for the above discussion. Given Mark's response, the fix for the above
discussion can be a part of stage 3 since it really is a bug fix of some
sort, which could likely bring some improvement to the run time cases. 

Once we are done with the vectorizer changes, the backend target costs
may also need to change a bit. But these would be fairly non intrusive
and we put them in before stage 2 closes and tweak them as needed in
stage 3.

Thanks,
Harsha