This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
- From: "Bin.Cheng" <amker dot cheng at gmail dot com>
- To: Richard Biener <richard dot guenther at gmail dot com>, "Bin.Cheng" <amker dot cheng at gmail dot com>, Michael Matz <matz at suse dot de>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Jakub Jelinek <jakub at redhat dot com>, Richard Sandiford <richard dot sandiford at arm dot com>
- Date: Tue, 26 Mar 2019 14:37:08 +0800
- Subject: Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
- References: <DB5PR0801MB27422285F855F93CFAEA210BE70B0@DB5PR0801MB2742.eurprd08.prod.outlook.com> <CAFiYyc2cCRs1u0DGuZ5AGmd-yu6fgXUyGpsZY+2+J7duqVsuYQ@mail.gmail.com> <CAHFci29mdLMipHpeW4djZv+HpPHP0hkphrys9vudjgTxOb0rtA@mail.gmail.com> <CAHFci298TSvGaq1g-jkgDCY3JvOoGfKAsx_+8q3KXogMqz2KGw@mail.gmail.com> <CAFiYyc1wJUZaeLghyyxbbszY6WBStsZdSkt9Pvwu8x5fUbJGhA@mail.gmail.com> <CAHFci2_EhCwPfmp2bK6t59T=zNN_Qttsn9P2gipSwZuQYyRZLg@mail.gmail.com> <CAFiYyc3+QJdcoa3moGRkPTSKh-vSAekBkOkyQzk+vSc0b5O+0Q@mail.gmail.com> <CAFiYyc17bNKNFR9-dBVEyM4qjC61Yvz=AWQw9-EquPmtX0+wtg@mail.gmail.com> <CAHFci280UOxGEx3S26UBUBOOfVVEzR5_tciTFty5p2YD7prO5Q@mail.gmail.com> <CAFiYyc2F6a5QEaP0UFjLrsroj87ouPZxDrbCkNSd9w9Cj=VT7w@mail.gmail.com> <mptpnqelagu.fsf@arm.com>
On Tue, Mar 26, 2019 at 8:56 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.cheng@gmail.com> wrote:
> >>
> >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >> >
> >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> >> > <richard.guenther@gmail.com> wrote:
> >> > >
> >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> >> > > > <richard.guenther@gmail.com> wrote:
> >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> >> > > >>>> <richard.guenther@gmail.com> wrote:
> >> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> >> > > >>>>>> Hi,
> >> > > >>>>>> As explained in the PR, given below test case:
> >> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> >> > > >>>>>>
> >> > > >>>>>> int
> >> > > >>>>>> main ()
> >> > > >>>>>> {
> >> > > >>>>>> short b;
> >> > > >>>>>> int i, d;
> >> > > >>>>>> for (b = 4; b >= 0; b--)
> >> > > >>>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>>> a[c + 1][b + 2] = a[c][b + 1];
> >> > > >>>>>> for (i = 0; i < 8; i++)
> >> > > >>>>>> for (d = 0; d < 10; d++)
> >> > > >>>>>> if (a[i][d] != (i == 3 && d == 6) * 4)
> >> > > >>>>>> __builtin_abort ();
> >> > > >>>>>> return 0;
> >> > > >>>>>>
> >> > > >>>>>> the loop nest is illegal for vectorization without reversing inner loop. The issue
> >> > > >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
> >> > > >>>>>> exposed this. Previously the vectorization is skipped because of unsupported memory
> >> > > >>>>>> operation. The outer loop vectorization unrolls the outer loop into:
> >> > > >>>>>>
> >> > > >>>>>> for (b = 4; b > 0; b -= 4)
> >> > > >>>>>> {
> >> > > >>>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>>> a[c + 1][6] = a[c][5];
> >> > > >>>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>>> a[c + 1][5] = a[c][4];
> >> > > >>>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>>> a[c + 1][4] = a[c][3];
> >> > > >>>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>>> a[c + 1][3] = a[c][2];
> >> > > >>>>>> }
> >> > > >>>>>> Then four inner loops are fused into:
> >> > > >>>>>> for (b = 4; b > 0; b -= 4)
> >> > > >>>>>> {
> >> > > >>>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>>> {
> >> > > >>>>>> a[c + 1][6] = a[c][5]; // S1
> >> > > >>>>>> a[c + 1][5] = a[c][4]; // S2
> >> > > >>>>>> a[c + 1][4] = a[c][3];
> >> > > >>>>>> a[c + 1][3] = a[c][2];
> >> > > >>>>>> }
> >> > > >>>>>> }
> >> > > >>>>>
> >> > > >>>>> Note that they are not really "fused" but they are interleaved. With
> >> > > >>>>> GIMPLE in mind
> >> > > >>>>> that makes a difference, you should get the equivalent of
> >> > > >>>>>
> >> > > >>>>> for (c = 0; c <= 6; c++)
> >> > > >>>>> {
> >> > > >>>>> tem1 = a[c][5];
> >> > > >>>>> tem2 = a[c][4];
> >> > > >>>>> tem3 = a[c][3];
> >> > > >>>>> tem4 = a[c][2];
> >> > > >>>>> a[c+1][6] = tem1;
> >> > > >>>>> a[c +1][5] = tem2;
> >> > > >>>>> a[c+1][4] = tem3;
> >> > > >>>>> a[c+1][3] = tem4;
> >> > > >>>>> }
> >> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
> >> > > >>> Hmm, I think this doesn't break it, well at least for part of the
> >> > > >>> analysis, because it is loop carried (backward) dependence goes wrong,
> >> > > >>> interleaving or not with the same iteration doesn't matter here.
> >> > > >>
> >> > > >> I think the idea is that forward dependences are always fine (negative distance)
> >> > > >> to vectorize. But with backward dependences we have to adhere to max_vf.
> >> > > >>
> >> > > >> It looks like for outer loop vectorization we only look at the distances in the
> >> > > >> outer loop but never at inner ones. But here the same applies but isn't that
> >> > > >> independend on the distances with respect to the outer loop?
> >> > > >>
> >> > > >> But maybe I'm misunderstanding how "distances" work here.
> >> > > > Hmm, I am not sure I understand "distance" correctly. With
> >> > > > description as in book like "Optimizing compilers for Modern
> >> > > > Architectures", distance is "# of iteration of sink ref - # of
> >> > > > iteration of source ref". Given below example:
> >> > > > for (i = 0; i < N; ++i)
> >> > > > {
> >> > > > x = arr[idx_1]; // S1
> >> > > > arr[idx_2] = x; // S2
> >> > > > }
> >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also
> >> > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is
> >> > > > i;
> >> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also
> >> > > > this is backward dependence. For example idx_1 is i and idx_2 is i +
> >> > > > 1;
> >> > > >
> >> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> >> > > > classic distance, we could result in negative distance in case of
> >> > > > backward dependence. When this happens at dependence carried level,
> >> > > > we manually reverse it. When this happens at inner level loop, we
> >> > > > simply keep the negative distance. And it's meaningless to talk about
> >> > > > forward/backward given dependence is carried at outer level loop.
> >> > > >
> >> > > > Outer loop vectorization is interesting. The problematic case has
> >> > > > backward dependence carried by outer loop. Because we don't check
> >> > > > dist vs. max_vf for such dep, the unrolled references could have outer
> >> > > > loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's
> >> > > > like we have outer loop index resolved as equal. Now it has
> >> > > > dependence only if c == c' + 1. I think previous comment on fusion
> >> > > > still hold for this and we now need to check backward dependence
> >> > > > between the two reference at inner level loop. I guess it's a bit
> >> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
> >> > > > original references and dependence. I think it's okay to do that
> >> > > > because distance of a[c][5] and a[c+1][5] is what we computed
> >> > > > previously for the original references at inner level loop.
> >> > > >
> >> > > > Not sure if I have missed something important.
> >> > >
> >> > > Not sure either. unroll-and-jam does
> >> > >
> >> > > FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> >> > > {
> >> > > /* A distance (a,b) is at worst transformed into (a/N,b) by the
> >> > > unrolling (factor N), so the transformation is valid if
> >> > > a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll
> >> > > factor needs to be limited so that the first condition holds.
> >> > > That may limit the factor down to zero in the worst case. */
> >> > > int dist = dist_v[0];
> >> > > if (dist < 0)
> >> > > gcc_unreachable ();
> >> > > else if ((unsigned)dist >= *unroll)
> >> > > ;
> >> > > else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >> > > || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >> > > && dist > 0))
> >> > > ;
> >> > > else
> >> > > *unroll = dist;
> >> > >
> >> > > where *unroll is similar to *max_vf I think. dist_v[0] is the innermost loop.
> >> > >
> >> > > The vectorizer does way more complicated things and only looks at the
> >> > > distance with respect to the outer loop as far as I can see which can
> >> > > be negative.
> >> > >
> >> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
> >> > > I think the idea was that when interleaving stmt-by-stmt then forward
> >> > > dependences would be preserved and thus we don't need to check the
> >> > > inner loop dependences. speaking with "forward vs. backward" dependences
> >> > > again, not distances...
> >> > >
> >> > > This also means that unroll-and-jam could be enhanced to "interleave"
> >> > > stmts and thus cover more cases?
> >> > >
> >> > > Again, I hope Micha can have a look here...
> >> >
> >> > Coming back to this... the vectorizer doesn't look at the inner loop distances
> >> > at all at the moment. Bins orginal patch adds that for the case where the
> >> > outer loop evolution results in a negative dependence distance. But what
> >> > if the outer loop evolution results in a zero distance or a positive distance
> >> > within the bounds of a sensible max_vf? So I think we need to "simply"
> >> > look at all distance vector components, not just that of the outer loop
> >> > and perform all regular checks. That also allows adjustment of max_vf
> >> > for inner loop dependences in case vectorization with smaller vectors is
> >> > still possible.
> >> >
> >> > Doing that by making loop_depth iterate
> >> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
> >> > if we make the code look at ->safelen of the inner loop when processing
> >> > inner loop distances (because safelen is 0 there). Just using ->safelen
> >> > from the outer loop doesn't regress anything it seems.
> >> >
> >> > So I am going to test the following attached. I have added the
> >> > testcase also with reversed inner loop to verify we'd apply outer loop
> >> > vectorization to that one.
> >> >
> >> > Any comments?
> >> I kind of think that we are checking for different things in this way.
> >> IIUC, dependence checks in outer loop vectorizer could be categorized
> >> into three parts. The regular vect related check on outer loop;
> >> dependence check to fuse inner loop iterations; and dependence check
> >> to shuffle scalar statement/reference together. To pass the fusion
> >> check, we only need to make sure no backward dependence is generated
> >> in the result. So for case of zero/positive distance in outer loop
> >> evolution, it won't result in backward dependence? For example, if we
> >> adjust the original loop like:
> >>
> >> for (b = 4; b >= 0; b--)
> >> for (c = 0; c <= 6; c++)
> >> - a[c + 1][b + 2] = a[c][b + 1];
> >> + a[c + 1][b + 1] = a[c][b + 2];
> >>
> >> The fusion result would be
> >> for (b = 4; b > 0; b -= 4)
> >> {
> >> for (c = 0; c <= 6; c++)
> >> {
> >> a[c + 1][5] = a[c][6]; // S1
> >> a[c + 1][4] = a[c][5]; // S2
> >> a[c + 1][3] = a[c][4];
> >> a[c + 1][2] = a[c][3];
> >> }
> >> }
> >> Though there is dependence between s1/s2, it's forward dependence now.
> >
> > Hmm, OK. But this doesn't vectorize with or without the patch, we claim
> >
> > t.c:9:3: note: dependence distance = 1.
> > t.c:11:29: missed: not vectorized, possible dependence between
> > data-refs a[c.3_15][_48] and a[_3][_50]
> > t.c:9:3: missed: bad data dependence.
> > t.c:9:3: missed: couldn't vectorize loop
> >
> >> Even we need to check the negative distance for inner loop, using the
> >> regular checks for fusion part for inner loop(s) is a bit strange.
> >> Using outer loop's safelen here is also unclear.
> >>
> >> I am not sure where we do the shuffle related check, is it (dist == 0) case?
> >> Is that the reason that the patch employs the whole regular checks?
> >
> > I think the suffling needs to look at the outer loop distance and it is the
> > == 0 and the forward dependence case where we check/constrain
> > against the vectorization factor?
> >
> > I agree that formulating the outer loop dependence testing in a way to
> > check the unroll-and-jam condition and the shuffling part separately
> > would be best. I was mainly worried that your patch only hooks into
> > the negative "outer" distance case and not the zero or positive
> > distance one. For example for unroll-and-jam we constrain the
> > maximum unroll factor. I wonder if we should simply call into
> > its adjust_unroll_factor from the vectorizer or copy&paste it
> > to the vectorizer. I'll note that with known dependence distances
> > it seems to never compute failure... at least it computes the
> > correct maximum vectorization factor for me.
> >
> > I also wonder about dependences for DRs in the outer loop
> > which we'd even more heavily re-order.
> >
> > Anyways, we should somehow fix this.
> >
> > Attached is a patch variant cut&pasting the unroll-and-jam
> > testing.
> >
> > Does this look better?
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> bin
> >> >
> >> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> >> >
> >> > Richard.
> >> >
> >> > 2019-03-21 Richard Biener <rguenther@suse.de>
> >> >
> >> > PR tree-optimization/81740
> >> > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >> > Perform distance vector related dependence checks for all
> >> > loops.
> >> >
> >> > * gcc.dg/vect/pr81740-1.c: New testcase.
> >> > * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > 2019-03-25 Richard Biener <rguenther@suse.de>
> >
> > PR tree-optimization/81740
> > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> > Add explicit unroll-and-jam dependence testing.
> >
> > * gcc.dg/vect/pr81740-1.c: New testcase.
> > * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > Index: gcc/tree-vect-data-refs.c
> > ===================================================================
> > --- gcc/tree-vect-data-refs.c (revision 269908)
> > +++ gcc/tree-vect-data-refs.c (working copy)
> > @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct
> > }
> >
> > loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> > + gcc_assert (loop_depth == 0);
> > +
> > + /* Perform unroll-and-jam test. */
> > + if (nested_in_vect_loop_p (loop, stmtinfo_a))
> > + {
> > + FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> > + {
> > + /* A distance (a,b) is at worst transformed into (a/N,b) by the
> > + unrolling (factor N), so the transformation is valid if
> > + a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll
> > + factor needs to be limited so that the first condition holds.
> > + That may limit the factor down to zero in the worst case. */
>
> I think it would be good to say that we're applying a two-stage
> check here (unroll-and-jam, then statement-level interleaving). As it
> stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove
> that outer-loop vectorisation is valid for any VF, whereas that isn't
> true for the interleaving part. That said...
>
> > + int dist = dist_v[0];
> > + if (dist < 0)
> > + gcc_unreachable ();
> > + else if ((unsigned)dist >= *max_vf)
> > + ;
> > + else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > + && dist > 0))
> > + ;
> > + else if (dist <= 1)
> > + return opt_result::failure_at (stmtinfo_a->stmt,
> > + "not vectorized, possible dependence"
> > + " between data-refs %T and %T\n",
> > + DR_REF (dra), DR_REF (drb));
> > + else
> > + {
> > + /* Dependence distance does not create dependence, as far as
> > + vectorization is concerned, in this case. */
> > + if (dump_enabled_p ())
> > + dump_printf_loc (MSG_NOTE, vect_location,
> > + "dependence between data-refs %T and %T "
> > + "constrains vectorization factor to %d\n",
> > + DR_REF (dra), DR_REF (drb), dist);
> > + *max_vf = dist;
> > + }
> > + }
> > + }
>
> ...I guess this is looping back to the original discussion, sorry, but I'm
> not sure taking it as a two-stage check really helps. We're going from:
>
> A:
> for i in [0:n]:
> for j in [0:n]:
> A(i,j)
> B(i,j)
>
> to:
>
> B:
> for i in [0:n:2]:
> for j in [0:n]:
> A(i,j)
> B(i,j)
> for j in [0:n]:
> A(i+1,j)
> B(i+1,j)
>
> to:
>
> C:
> for i in [0:n:2]:
> for j in [0:n]:
> A(i,j)
> B(i,j)
> A(i+1,j)
> B(i+1,j)
>
> to:
>
> D:
> for i in [0:n:2]:
> for j in [0:n]:
> A(i,j)
> A(i+1,j)
> B(i,j)
> B(i+1,j)
>
> but since outer loop vectorisation doesn't change the order of
> statements for each "i", IMO it's clearer to go directly from B to D
> by treating the inner loop as though we were completely unrolling it.
> C doesn't seem like a useful midway point.
>
> I think the only cases the patch is handling on top of existing checks are:
>
> (a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag.
> (b) a==0 && b==0, which we now reject earlier
>
> I don't think (b) makes any difference in practice because
> (i) we already reject accesses with a zero step and (ii)
> without the support for runtime aliasing checks for outer loops,
> we're not going to be able to add any extra disambiguation for
> things like:
>
> d[i][j] = ...;
> ... = d[i][j];
>
> relative to other loop accesses, so there won't be any cases we can
> handle here that wouldn't have been propagated earlier.
>
> So I think we're really left with (a) being the useful case,
> which is also the one added by Bin's original patch.
I felt that my patch is enough to fix the issue, but didn't explicitly
mention because I failed to reason about it case by case. If it can
be somehow proven, I would suggest first apply it now and
revisit/refactor vect dependence check in the light of your comments
in Stage 1. BTW, I am not fully following the unrolling view. Seems
to me unroll doesn't change order, but there is possible reordering
here.
Thanks,
bin
>
> Based on the "complete unrolling" view, if we number statements as
> (i, n), where i is the outer loop iteration and n is a statement number
> in the (completely unrolled) loop body, then the original scalar code
> executes in lexicographical order while for the vector loop:
>
> (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
> (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
> (well, nx unrestricted, but only nx>=0 is useful given (1))
>
> So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
> we need to restrict VF to ix so that (2) ensures the right order.
> This means that the unnormalised distances of interest are:
>
> - (ix, -nx), ix>=1, nx>=0
> - (-ix, nx), ix>=1, nx>=0
>
> But the second gets normalised to the first, which is actually useful
> in this case :-).
>
> In terms of the existing code, I think that means we want to change
> the handling of nested statements (only) to:
>
> - ignore DDR_REVERSED_P (ddr)
> - restrict the main dist > 0 case to when the inner distance is <= 0.
>
> This should have the side effect of allowing outer-loop vectorisation for:
>
> void __attribute__ ((noipa))
> f (int a[][N], int b[restrict])
> {
> for (int i = N - 1; i-- > 0; )
> for (int j = 0; j < N - 1; ++j)
> a[j + 1][i] = a[j][i + 1] + b[i];
> }
>
> At the moment we reject this, but AFAICT it should be OK.
> (We do allow it for s/i + 1/i/, since then the outer distance is 0.)
>
> I think we can also skip the existing dist==0 code for nested statements
> since any case that really matters (such as zero steps) should also have
> a second dependence distance (1, 0) that we'd handle as above. Doing that
> is probably something for stage 1 though.
>
> Hope at least some of that was vaguely right. ;-)
>
> Thanks,
> Richard