This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] Vectorization of indexed elements
- From: Richard Biener <rguenther at suse dot de>
- To: Vidya Praveen <vidyapraveen at arm dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, "ook at ucw dot cz" <ook at ucw dot cz>
- Date: Tue, 1 Oct 2013 10:26:25 +0200 (CEST)
- Subject: Re: [RFC] Vectorization of indexed elements
- Authentication-results: sourceware.org; auth=none
- References: <20130909172533 dot GA25330 at e103625-lin dot cambridge dot arm dot com> <alpine dot DEB dot 2 dot 10 dot 1309091949090 dot 3565 at laptop-mg dot saclay dot inria dot fr> <20130924150425 dot GE22907 at e103625-lin dot cambridge dot arm dot com> <alpine dot LNX dot 2 dot 00 dot 1309251123490 dot 29411 at zhemvz dot fhfr dot qr> <20130927145008 dot GA861 at e103625-lin dot cambridge dot arm dot com> <20130927151945 dot GB861 at e103625-lin dot cambridge dot arm dot com> <20130930125454 dot GD3460 at e103625-lin dot cambridge dot arm dot com> <alpine dot LNX dot 2 dot 00 dot 1309301504120 dot 5759 at zhemvz dot fhfr dot qr> <20130930140001 dot GF3460 at e103625-lin dot cambridge dot arm dot com>
On Mon, 30 Sep 2013, Vidya Praveen wrote:
> On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> >
> > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > [...]
> > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > >
> > > > > > > vc:V4SI[0] = c
> > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > >
> > > > > > > Or is there any other way to do this?
> > > > > >
> > > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > > What's the single lane load in your example?
> > > > >
> > > > > Loading just one lane of the vector like this:
> > > > >
> > > > > vc:V4SI[0] = c // from the above scalar example
> > > > >
> > > > > or
> > > > >
> > > > > vc:V4SI[0] = c[2]
> > > > >
> > > > > is what I meant by single lane load. In this example:
> > > > >
> > > > > t = c[2]
> > > > > ...
> > > > > vb:v4si = b[0:3]
> > > > > vc:v4si = { t, t, t, t }
> > > > > va:v4si = vb:v4si <op> vc:v4si
> > > > >
> > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be
> > > > > seen as vec_select:SI (vect_t 0) ).
> > > > >
> > > > > > I'd expect the instruction
> > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > >
> > > > > As much as I went through the code, this is only done using vect_init. It is
> > > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > > >
> > > > Do you see any issues if we expand such constructor as vec_duplicate directly
> > > > instead of going through vect_init way?
> > >
> > > Sorry, that was a bad question.
> > >
> > > But here's what I would like to propose as a first step. Please tell me if this
> > > is acceptable or if it makes sense:
> > >
> > > - Introduce standard pattern names
> > >
> > > "vmulim4" - vector muliply with second operand as indexed operand
> > >
> > > Example:
> > >
> > > (define_insn "vmuliv4si4"
> > > [set (match_operand:V4SI 0 "register_operand")
> > > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > (vec_duplicate:V4SI
> > > (vec_select:SI
> > > (match_operand:V4SI 2 "register_operand")
> > > (match_operand:V4SI 3 "immediate_operand)))))]
> > > ...
> > > )
> >
> > We could factor this with providing a standard pattern name for
> >
> > (define_insn "vdupi<mode>"
> > [set (match_operand:<mode> 0 "register_operand")
> > (vec_duplicate:<mode>
> > (vec_select:<scalarmode>
> > (match_operand:<mode> 1 "register_operand")
> > (match_operand:SI 2 "immediate_operand))))]
>
> This is good. I did think about this but then I thought of avoiding the need
> for combiner patterns :-)
>
> But do you find the lane specific mov pattern I proposed, acceptable?
The specific mul pattern? As said, consider factoring to vdupi to
avoid an explosion in required special optabs.
> > (you use V4SI for the immediate?
>
> Sorry typo again!! It should've been SI.
>
> > Ideally vdupi has another custom
> > mode for the vector index).
> >
> > Note that this factored pattern is already available as vec_perm_const!
> > It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).
> >
> > Which means that on the GIMPLE level we should try to combine
> >
> > el_4 = BIT_FIELD_REF <v_3, ...>;
> > v_5 = { el_4, el_4, ... };
>
> I don't think we reach this state at all for the scenarios in discussion.
> what we generally have is:
>
> el_4 = MEM_REF < array + index*size >
> v_5 = { el_4, ... }
>
> Or am I missing something?
Well, but in that case I doubt it is profitable (or even valid!) to
turn this into a vector lane load from the array. If it is profitable
to perform a vector read (because we're going to use the other elements
of the vector as well) then the vectorizer should produce a vector
load and materialize the uniform vector from one of its elements.
Maybe at this point you should show us a compilable C testcase
with a loop that should be vectorized using your instructions in
the end?
Thanks,
Richard.