This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] Vectorization of indexed elements
- From: Vidya Praveen <vidyapraveen at arm dot com>
- To: Richard Biener <rguenther at suse dot de>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, "ook at ucw dot cz" <ook at ucw dot cz>
- Date: Tue, 24 Sep 2013 16:03:43 +0100
- Subject: Re: [RFC] Vectorization of indexed elements
- Authentication-results: sourceware.org; auth=none
- References: <20130909172533 dot GA25330 at e103625-lin dot cambridge dot arm dot com> <alpine dot DEB dot 2 dot 10 dot 1309091949090 dot 3565 at laptop-mg dot saclay dot inria dot fr> <alpine dot LNX dot 2 dot 00 dot 1309101019060 dot 3869 at zhemvz dot fhfr dot qr>
On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> On Mon, 9 Sep 2013, Marc Glisse wrote:
>
> > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> >
> > > Hello,
> > >
> > > This post details some thoughts on an enhancement to the vectorizer that
> > > could take advantage of the SIMD instructions that allows indexed element
> > > as an operand thus reducing the need for duplication and possibly improve
> > > reuse of previously loaded data.
> > >
> > > Appreciate your opinion on this.
> > >
> > > ---
> > >
> > > A phrase like this:
> > >
> > > for(i=0;i<4;i++)
> > > a[i] = b[i] <op> c[2];
> > >
> > > is usually vectorized as:
> > >
> > > va:V4SI = a[0:3]
> > > vb:V4SI = b[0:3]
> > > t = c[2]
> > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > > ...
> > > va:V4SI = vb:V4SI <op> vc:V4SI
> > >
> > > But this could be simplified further if a target has instructions that
> > > support
> > > indexed element as a parameter. For example an instruction like this:
> > >
> > > mul v0.4s, v1.4s, v2.4s[2]
> > >
> > > can perform multiplication of each element of v2.4s with the third element
> > > of
> > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > elements of v0.4s.
> > >
> > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > operand c[2] specially (and by taking in to consideration if the machine
> > > supports indexed element as an operand for <op> through a target hook or
> > > macro)
> > > and consider this as vectorizable statement without having to duplicate the
> > > elements explicitly.
> > >
> > > There are fews ways this could be represented at gimple:
> > >
> > > ...
> > > va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > ...
> > >
> > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > vectorizable statement:
> >
> > Might as well allow any scalar then...
>
> I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> would necessarily be two extra separate statements and thus subject
> to CSE obfuscating it enough for RTL expansion to no longer notice it.
I also thought about having a specialized expression like
VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index>
to mean:
arg0 = arg1 <op> arg2[index]
and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
is handled in expr.c. But I dropped this idea since we may need to introduce
many such nodes.
>
> That said, allowing mixed scalar/vector ops isn't very nice and
> your scheme can be simplified by just using
>
> vc:V4SI = VEC_DUPLICATE_EXPR <...>
> va:V4SI = vb:V4SI <op> vc:V4SI
>
> where the expander only has to see that vc:V4SI is defined by
> a duplicate.
I did try out something like this quickly before I posted this RFC, though
I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
for:
for(i=0;i<8;i++)
a[i] = b[2] * c[i];
I could generate:
...
<bb 8>:
_88 = prolog_loop_adjusted_niters.6_60 * 4;
vectp_c.13_87 = c_10(D) + _88;
vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; <<<<<<<<
vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); <<<<<<<<
_96 = prolog_loop_adjusted_niters.6_60 * 4;
vectp_a.19_95 = a_6(D) + _96;
vect__12.14_115 = MEM[(int *)vectp_c.13_87];
vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; <<<<<<<<
MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; <<<<<<<<
vectp_c.12_118 = vectp_c.13_87 + 16;
vectp_a.18_119 = vectp_a.19_95 + 16;
ivtmp_120 = 1;
if (ivtmp_120 < bnd.8_62)
goto <bb 9>;
else
goto <bb 11>;
<bb 9>:
# vectp_c.12_89 = PHI <vectp_c.12_118(8)>
# vectp_a.18_97 = PHI <vectp_a.18_119(8)>
# ivtmp_14 = PHI <ivtmp_120(8)>
vect__12.14_91 = MEM[(int *)vectp_c.12_89]; <<<<<<<<
vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; <<<<<<<<
MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
...
It's a crude implementation so VEC_DUP is printed as:
(vect_ldidx_.16_92) <<< ??? >>> (0);
> > > ...
> > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > ...
> > >
> > > For the sake of explanation, the above two representations assumes that
> > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > > the
> > > only use of 'c' then it may be safer to just load one element and use it
> > > like
> > > this:
> > >
> > > vc:V4SI[0] = c[2]
> > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > This could also mean that expressions involving scalar could be treated
> > > similarly. For example,
> > >
> > > for(i=0;i<4;i++)
> > > a[i] = b[i] <op> c
> > >
> > > could be vectorized as:
> > >
> > > vc:V4SI[0] = c
> > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > Such a change would also require new standard pattern names to be defined
> > > for
> > > each <op>.
> > >
> > > Alternatively, having something like this:
> > >
> > > ...
> > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > ...
> > >
> > > would remove the need to introduce several new standard pattern names but
> > > have
> > > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > > expect
> > > the target to have combiner patterns.
> >
> > The cost estimation wouldn't be very good, but aren't combine patterns enough
> > for the whole thing? Don't you model your mul instruction as:
> >
> > (mult:V4SI
> > (match_operand:V4SI)
> > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> >
> > anyway? Seems that combine should be able to handle it. What currently happens
> > that we fail to generate the right instruction?
> >
> > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> > vec_duplicate, adding new nodes is always painful.
>
> True, though CONSTRUCTOR isn't a good vec_duplicate primitive. But yes,
> we have it that way at the moment and indeed adding new nodes is always
> painful.
>
> > > This enhancement could possibly help further optimizing larger scenarios
> > > such
> > > as linear systems.
>
> Given that the vectorizer already handles all cases you quote but
> just the expansion doesn't use the targets special abilities - can't
> you just teach the expander to lookup the definition of the
> vectors and see if it is an uniform CONSTRUCTOR?
>
> Richard.
>
I did think of handling this as a part of expanding CONSTRUCTOR but I thought
it may not a good idea if we want to enhance this support in the future to
handle larger cases like this one (hypothetical example):
for i = 0 to 3
for j = 0 to 3
a[j] += b[j] * c[i]
to
a[0:3] += b[0:3] + c[0]
a[0:3] += b[0:3] + c[1]
a[0:3] += b[0:3] + c[2]
a[0:3] += b[0:3] + c[3]
Secondly, I am not sure if introducing a single lane load at the time of
expansion and removing or expecting the existing scalar load to be removed
later as it is unused, is a good idea.
Please advise.
Cheers
VP