This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>> Hi,
>>>>>> As explained in the PR, given below test case:
>>>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>>>
>>>>>> int
>>>>>> main ()
>>>>>> {
>>>>>> short b;
>>>>>> int i, d;
>>>>>> for (b = 4; b >= 0; b--)
>>>>>> for (c = 0; c <= 6; c++)
>>>>>> a[c + 1][b + 2] = a[c][b + 1];
>>>>>> for (i = 0; i < 8; i++)
>>>>>> for (d = 0; d < 10; d++)
>>>>>> if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>>> __builtin_abort ();
>>>>>> return 0;
>>>>>>
>>>>>> the loop nest is illegal for vectorization without reversing inner loop. The issue
>>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>>>>>> exposed this. Previously the vectorization is skipped because of unsupported memory
>>>>>> operation. The outer loop vectorization unrolls the outer loop into:
>>>>>>
>>>>>> for (b = 4; b > 0; b -= 4)
>>>>>> {
>>>>>> for (c = 0; c <= 6; c++)
>>>>>> a[c + 1][6] = a[c][5];
>>>>>> for (c = 0; c <= 6; c++)
>>>>>> a[c + 1][5] = a[c][4];
>>>>>> for (c = 0; c <= 6; c++)
>>>>>> a[c + 1][4] = a[c][3];
>>>>>> for (c = 0; c <= 6; c++)
>>>>>> a[c + 1][3] = a[c][2];
>>>>>> }
>>>>>> Then four inner loops are fused into:
>>>>>> for (b = 4; b > 0; b -= 4)
>>>>>> {
>>>>>> for (c = 0; c <= 6; c++)
>>>>>> {
>>>>>> a[c + 1][6] = a[c][5]; // S1
>>>>>> a[c + 1][5] = a[c][4]; // S2
>>>>>> a[c + 1][4] = a[c][3];
>>>>>> a[c + 1][3] = a[c][2];
>>>>>> }
>>>>>> }
>>>>>
>>>>> Note that they are not really "fused" but they are interleaved. With
>>>>> GIMPLE in mind
>>>>> that makes a difference, you should get the equivalent of
>>>>>
>>>>> for (c = 0; c <= 6; c++)
>>>>> {
>>>>> tem1 = a[c][5];
>>>>> tem2 = a[c][4];
>>>>> tem3 = a[c][3];
>>>>> tem4 = a[c][2];
>>>>> a[c+1][6] = tem1;
>>>>> a[c +1][5] = tem2;
>>>>> a[c+1][4] = tem3;
>>>>> a[c+1][3] = tem4;
>>>>> }
>>>> Yeah, I will double check if this abstract breaks the patch and how.
>>> Hmm, I think this doesn't break it, well at least for part of the
>>> analysis, because it is loop carried (backward) dependence goes wrong,
>>> interleaving or not with the same iteration doesn't matter here.
>>
>> I think the idea is that forward dependences are always fine (negative distance)
>> to vectorize. But with backward dependences we have to adhere to max_vf.
>>
>> It looks like for outer loop vectorization we only look at the distances in the
>> outer loop but never at inner ones. But here the same applies but isn't that
>> independend on the distances with respect to the outer loop?
>>
>> But maybe I'm misunderstanding how "distances" work here.
> Hmm, I am not sure I understand "distance" correctly. With
> description as in book like "Optimizing compilers for Modern
> Architectures", distance is "# of iteration of sink ref - # of
> iteration of source ref". Given below example:
> for (i = 0; i < N; ++i)
> {
> x = arr[idx_1]; // S1
> arr[idx_2] = x; // S2
> }
> if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also
> this is forward dependence. For example, idx_1 is i + 1 and idx_2 is
> i;
> If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also
> this is backward dependence. For example idx_1 is i and idx_2 is i +
> 1;
>
> In GCC, we always try to subtract idx_2 from idx_1 first in computing
> classic distance, we could result in negative distance in case of
> backward dependence. When this happens at dependence carried level,
> we manually reverse it. When this happens at inner level loop, we
> simply keep the negative distance. And it's meaningless to talk about
> forward/backward given dependence is carried at outer level loop.
>
> Outer loop vectorization is interesting. The problematic case has
> backward dependence carried by outer loop. Because we don't check
> dist vs. max_vf for such dep, the unrolled references could have outer
> loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's
> like we have outer loop index resolved as equal. Now it has
> dependence only if c == c' + 1. I think previous comment on fusion
> still hold for this and we now need to check backward dependence
> between the two reference at inner level loop. I guess it's a bit
> trick to model dependence between a[c][5] and a[c+1][5] using the
> original references and dependence. I think it's okay to do that
> because distance of a[c][5] and a[c+1][5] is what we computed
> previously for the original references at inner level loop.
>
> Not sure if I have missed something important.
Not sure either. unroll-and-jam does
FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
{
/* A distance (a,b) is at worst transformed into (a/N,b) by the
unrolling (factor N), so the transformation is valid if
a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll
factor needs to be limited so that the first condition holds.
That may limit the factor down to zero in the worst case. */
int dist = dist_v[0];
if (dist < 0)
gcc_unreachable ();
else if ((unsigned)dist >= *unroll)
;
else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
|| (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
&& dist > 0))
;
else
*unroll = dist;
where *unroll is similar to *max_vf I think. dist_v[0] is the innermost loop.
The vectorizer does way more complicated things and only looks at the
distance with respect to the outer loop as far as I can see which can
be negative.
Not sure if fusion and vectorizer "interleaving" makes a difference here.
I think the idea was that when interleaving stmt-by-stmt then forward
dependences would be preserved and thus we don't need to check the
inner loop dependences. speaking with "forward vs. backward" dependences
again, not distances...
This also means that unroll-and-jam could be enhanced to "interleave"
stmts and thus cover more cases?
Again, I hope Micha can have a look here...
Richard.
> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>>
>>>>>
>>>>>> The loop fusion needs to meet the dependence requirement. Basically, GCC's data
>>>>>> dependence analyzer does not model dep between references in sibling loops, but
>>>>>> in practice, fusion requirement can be checked by analyzing all data references
>>>>>> after fusion, and there is no backward data dependence.
>>>>>>
>>>>>> Apparently, the requirement is violated because we have backward data dependence
>>>>>> between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse the inner
>>>>>> loop, the outer loop would become legal for vectorization.
>>>>>>
>>>>>> This patch fixes the issue by enforcing dependence check. It also adds two tests
>>>>>> with one shouldn't be vectorized and the other should. Bootstrap and test on x86_64
>>>>>> and AArch64. Is it OK?
>>>>>
>>>>> I think you have identified the spot where things go wrong but I'm not
>>>>> sure you fix the
>>>>> problem fully. The spot you pacth is (loop is the outer loop):
>>>>>
>>>>> loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>>>>> ...
>>>>> FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>>>> {
>>>>> int dist = dist_v[loop_depth];
>>>>> ...
>>>>> if (dist > 0 && DDR_REVERSED_P (ddr))
>>>>> {
>>>>> /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>>>> reversed (to make distance vector positive), and the actual
>>>>> distance is negative. */
>>>>> if (dump_enabled_p ())
>>>>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>> "dependence distance negative.\n");
>>>>>
>>>>> where you add
>>>>>
>>>>> + /* When doing outer loop vectorization, we need to check if there is
>>>>> + backward dependence at inner loop level if dependence at the outer
>>>>> + loop is reversed. See PR81740 for more information. */
>>>>> + if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>>>>> + || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>>>>> + {
>>>>> + unsigned inner_depth = index_in_loop_nest (loop->inner->num,
>>>>> + DDR_LOOP_NEST (ddr));
>>>>> + if (dist_v[inner_depth] < 0)
>>>>> + return true;
>>>>> + }
>>>>>
>>>>> but I don't understand how the dependence direction with respect to the
>>>>> outer loop matters here.
>>>> If the direction wrto outer loop is positive by itself, i.e,
>>>> reversed_p equals to false, then dist is checked against max_vf. In
>>>> this case, it's not possible to have references refer to the same
>>>> object?
>>>> On the other hand, dist is not checked at all for reversed case.
>>>> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>>>>
>>>>> Given there's DDR_REVERSED on the outer loop distance what does that
>>>>> mean for the inner loop distance given the quite non-obvious code handling
>>>>> this case in tree-data-ref.c:
>>>>>
>>>>> /* Verify a basic constraint: classic distance vectors should
>>>>> always be lexicographically positive.
>>>>>
>>>>> Data references are collected in the order of execution of
>>>>> the program, thus for the following loop
>>>>>
>>>>> | for (i = 1; i < 100; i++)
>>>>> | for (j = 1; j < 100; j++)
>>>>> | {
>>>>> | t = T[j+1][i-1]; // A
>>>>> | T[j][i] = t + 2; // B
>>>>> | }
>>>>>
>>>>> references are collected following the direction of the wind:
>>>>> A then B. The data dependence tests are performed also
>>>>> following this order, such that we're looking at the distance
>>>>> separating the elements accessed by A from the elements later
>>>>> accessed by B. But in this example, the distance returned by
>>>>> test_dep (A, B) is lexicographically negative (-1, 1), that
>>>>> means that the access A occurs later than B with respect to
>>>>> the outer loop, ie. we're actually looking upwind. In this
>>>>> case we solve test_dep (B, A) looking downwind to the
>>>>> lexicographically positive solution, that returns the
>>>>> distance vector (1, -1). */
>>>>> if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>>>> {
>>>>> lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>>>> if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>>>> return false;
>>>>> compute_subscript_distance (ddr);
>>>>> if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>>>> &index_carry))
>>>>> return false;
>>>>> save_dist_v (ddr, save_v);
>>>>> DDR_REVERSED_P (ddr) = true;
>>>>>
>>>>> that is, what does dist_v[inner_depth] < 0 tell us here? Does it tell us
>>>>> that the dependence direction is reverse of the outer loop?
>>>> Hmm, this part of code is a bit confusing for me. So we always
>>>> compute classic distance by sorting two data references in lexico
>>>> order, which could be the opposite given we collect references by dom
>>>> order. IIUC, the negative dist at inner loop means a backward
>>>> dependence for that index/loop. It's not related to outer loop or the
>>>> reversed_p flag.
>>>>
>>>> Thanks,
>>>> bin
>>>>>
>>>>> CCing Micha who might remember some more details from unroll-and-jam.
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> bin
>>>>>> 2017-12-15 Bin Cheng <bin.cheng@arm.com>
>>>>>>
>>>>>> PR tree-optimization/81740
>>>>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>>>>> of outer loop vectorization, check backward dependence at inner loop
>>>>>> if dependence at outer loop is reversed.
>>>>>>
>>>>>> gcc/testsuite
>>>>>> 2017-12-15 Bin Cheng <bin.cheng@arm.com>
>>>>>>
>>>>>> PR tree-optimization/81740
>>>>>> * gcc.dg/vect/pr81740-1.c: New test.
>>>>>> * gcc.dg/vect/pr81740-2.c: Refine test.