[PATCH, PR52252] Vectorization for load/store groups of size 3.

Tue Feb 11 15:20:00 GMT 2014

On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:

> Missed patch attached in plain-text.
> 
> I have copyright assignment on file with the FSF covering work on GCC.
> 
> Load/stores groups of length 3 is the most frequent non-power-of-2
> case. It is used in RGB image processing (like test case in PR52252).
> For sure we can extend the patch to length 5 and more. However, this
> potentially affect performance on some other architectures and
> requires larger testing. So length 3 it is just first step.The
> algorithm in the patch could be modified for a general case in several
> steps.
> 
> I understand that the patch should wait for the stage 1, however since
> its ready we can discuss it right now and make some changes (like
> general size of group).

Other than that I'd like to see a vectorizer hook querying the cost of a
vec_perm_const expansion instead of adding vec_perm_shuffle
(thus requires the constant shuffle mask to be passed as well
as the vector type).  That's more useful for other uses that
would require (arbitrary) shuffles.

Didn't look at the rest of the patch yet - queued in my review
pipeline.

Thanks,
Richard.

> Thanks,
> Evgeny
> 
> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
> >
> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
> >
> > > Hi,
> > >
> > > The patch gives an expected 3 times gain for the test case in the PR52252
> > > (and even 6 times for AVX2).
> > > It passes make check and bootstrap on x86.
> > > spec2000/spec2006 got no regressions/gains on x86.
> > >
> > > Is this patch ok?
> >
> > I've worked on generalizing the permutation support in the light
> > of the availability of the generic shuffle support in the IL
> > but hit some road-blocks in the way code-generation works for
> > group loads with permutations (I don't remember if I posted all patches).
> >
> > This patch seems to be to a slightly different place but it again
> > special-cases a specific permutation.  Why's that?  Why can't we
> > support groups of size 7 for example?  So - can this be generalized
> > to support arbitrary non-power-of-two load/store groups?
> >
> > Other than that the patch has to wait for stage1 to open again,
> > of course.  And it misses a testcase.
> >
> > Btw, do you have a copyright assignment on file with the FSF covering
> > work on GCC?
> >
> > Thanks,
> > Richard.
> >
> > > ChangeLog:
> > >
> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
> > >
> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> > >         check for stores group of length 3.
> > >         (vect_permute_store_chain): New permutations for stores group of
> > >         length 3.
> > >         (vect_grouped_load_supported): New check for loads group of length
> > > 3.
> > >         (vect_permute_load_chain): New permutations for loads group of
> > > length 3.
> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
> > > vec_perm_shuffle
> > >         for the new permutations.
> > >         (vect_model_load_cost): Ditto.
> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
> > >         * config/arm/arm.c: Ditto.
> > >         * config/rs6000/rs6000.c: Ditto.
> > >         * config/spu/spu.c: Ditto.
> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
> > > byte
> > >         shuffle on some x86 architectures.
> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
> > > permutations.
> > >         Fixing cost for other permutations.
> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
> > >         slow (TARGET_SLOW_PHUFFB).
> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
> > >         Adding new shuffle cost only when byte shuffle is expected.
> > >         Fixing cost model for Silvermont.
> > >
> > > Thanks,
> > > Evgeny
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE / SUSE Labs
> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer