On arm targets the following code fails to generate a vld3: struct pixel { char r,g,b; }; void t2(int len, struct pixel * __restrict p, struct pixel * __restrict x) { len = len & ~31; for (int i = 0; i < len; i++){ p[i].r = x[i].r * 2; p[i].g = x[i].g * 3; p[i].b = x[i].b * 4; } } Yes the same code with line 11 changed to: p[i].g = x[i].g; does generate a vld3.
Confirmed. We go down the SLP path here because the vectorizer thinks that SLP is always cheaper than using interleaving (which generally is true if there were not targets which can do the load plus interleave with load-lanes ...). I think this may be a regression as well because I enhanced SLP to apply to way more cases. Note that my plan is to make the vectorizer consider both (well, not really, but this bug shows I maybe should try), SLP and non-SLP, and evaluate based on costs which route to go.
Richard, Should I create a test case that fails until you resolve this in GCC 6? On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 > > Richard Biener <rguenth at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|UNCONFIRMED |ASSIGNED > Last reconfirmed| |2015-08-25 > CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org > Depends on| |66721 > Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org > Ever confirmed|0 |1 > > --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- > Confirmed. We go down the SLP path here because the vectorizer thinks that > SLP is always cheaper than using interleaving (which generally is true > if there were not targets which can do the load plus interleave with > load-lanes ...). > > I think this may be a regression as well because I enhanced SLP to apply > to way more cases. > > Note that my plan is to make the vectorizer consider both (well, not really, > but this bug shows I maybe should try), SLP and non-SLP, and evaluate based > on costs which route to go. > > > Referenced Bugs: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 > [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 > > --- Comment #2 from Michael Collison <michael.collison at linaro dot org> --- > Richard, > > Should I create a test case that fails until you resolve this in GCC 6? If you can provide one that I can check in together with a fix that would be nice. Having it in the tree now and FAILing isn't according to our policies. > On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 > > > > Richard Biener <rguenth at gcc dot gnu.org> changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------------- > > Status|UNCONFIRMED |ASSIGNED > > Last reconfirmed| |2015-08-25 > > CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org > > Depends on| |66721 > > Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org > > Ever confirmed|0 |1 > > > > --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- > > Confirmed. We go down the SLP path here because the vectorizer thinks that > > SLP is always cheaper than using interleaving (which generally is true > > if there were not targets which can do the load plus interleave with > > load-lanes ...). > > > > I think this may be a regression as well because I enhanced SLP to apply > > to way more cases. > > > > Note that my plan is to make the vectorizer consider both (well, not really, > > but this bug shows I maybe should try), SLP and non-SLP, and evaluate based > > on costs which route to go. > > > > > > Referenced Bugs: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 > > [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs > >
Hi Richard, No I do not have a fix now. Thanks for the info on the policy. On 08/25/2015 03:05 AM, rguenther at suse dot de wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 > > --- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- > On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote: > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 >> >> --- Comment #2 from Michael Collison <michael.collison at linaro dot org> --- >> Richard, >> >> Should I create a test case that fails until you resolve this in GCC 6? > If you can provide one that I can check in together with a fix that > would be nice. Having it in the tree now and FAILing isn't according > to our policies. > >> On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote: >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 >>> >>> Richard Biener <rguenth at gcc dot gnu.org> changed: >>> >>> What |Removed |Added >>> ---------------------------------------------------------------------------- >>> Status|UNCONFIRMED |ASSIGNED >>> Last reconfirmed| |2015-08-25 >>> CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org >>> Depends on| |66721 >>> Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org >>> Ever confirmed|0 |1 >>> >>> --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- >>> Confirmed. We go down the SLP path here because the vectorizer thinks that >>> SLP is always cheaper than using interleaving (which generally is true >>> if there were not targets which can do the load plus interleave with >>> load-lanes ...). >>> >>> I think this may be a regression as well because I enhanced SLP to apply >>> to way more cases. >>> >>> Note that my plan is to make the vectorizer consider both (well, not really, >>> but this bug shows I maybe should try), SLP and non-SLP, and evaluate based >>> on costs which route to go. >>> >>> >>> Referenced Bugs: >>> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 >>> [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs >>
I note that the efficiency you gain is only by a reduced number of loads/store instructions. vld3 instead of six vldr (huh, appearantly vld3 can load 16 byte vectors but vldr only 8 byte ones?). I assume vld3 has no penalty for the lane-split itself so the code-size reduction is always wanted. Thus we'd want to always use a lane load/store even if the permutation is pointless as soon as we'd otherwise would issue more than one SLP load, say for void t5 (int len, int * __restrict p, int * __restrict q) { for (int i = 0; i < len; i+=8) { p[i] = q[i] * 2; p[i+1] = q[i+1] * 2; p[i+2] = q[i+2] * 2; p[i+3] = q[i+3] * 2; p[i+4] = q[i+4] * 2; p[i+5] = q[i+5] * 2; p[i+6] = q[i+6] * 2; p[i+7] = q[i+7] * 2; } } instead of .L4: vldr d18, [r2, #-16] vldr d19, [r2, #-8] vldr d16, [r2, #-32] vldr d17, [r2, #-24] vshl.i32 q9, q9, #1 vshl.i32 q8, q8, #1 add r3, r3, #1 cmp r0, r3 vstr d18, [r1, #-16] vstr d19, [r1, #-8] vstr d16, [r1, #-32] vstr d17, [r1, #-24] add r2, r2, #32 add r1, r1, #32 bhi .L4 use vld2.32 / vst2.32? Generally for SLP the implicit permute performed by those instructions could be modeled properly (and the SLP chain permuted accordingly).
Author: rguenth Date: Mon Dec 14 15:33:20 2015 New Revision: 231620 URL: https://gcc.gnu.org/viewcvs?rev=231620&root=gcc&view=rev Log: 2015-12-10 Richard Biener <rguenther@suse.de> PR tree-optimization/68707 PR tree-optimization/67323 * tree-vect-slp.c (vect_analyze_slp_instance): Drop SLP instances if they can be vectorized using load/store-lane instructions. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-slp.c
Should be fixed now.
Hi Richard, I tried this with trunk and was unable to generate the vld3. What vectorizer options did you use?
Note the question comes here is which is better using ldr/str followed by a few mult or ld3/st3 followed by a few shifts/adds. I think it depends on the micro-arch really (at least for aarch32). In fact I think ldr/str followed by a few mult is much better for ThunderX and most likely also Cortex-A57 (at least that is how I read the optimizing manual).
(In reply to Michael Collison from comment #8) > Hi Richard, > > I tried this with trunk and was unable to generate the vld3. What vectorizer > options did you use? Ah, I just assumed it was fixed because the patch for PR68707 was checked in. But that conditions the "fix" on the SLP needing permutations which doesn't trigger here. Let's re-open then. As asked in that other PR the question is if vld3/std3 is really cheaper (it's definitely smaller code).
Andrew, It may be the case that is not a win on all microarchitectures however I think we should allow the vectorizer to (optionally) generate the vld3 and deal with the differences via the cost models.