67323 – Use non-unit stride loads by preference when applicable

Bug 67323 - Use non-unit stride loads by preference when applicable

Summary: Use non-unit stride loads by preference when applicable

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	6.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:	66721 68707
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2015-08-23 03:14 UTC by Michael Collison
Modified:	2021-05-04 12:32 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-08-25 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Collison 2015-08-23 03:14:05 UTC

On arm targets the following code fails to generate a vld3:

struct pixel {
  char r,g,b;
};

void 
t2(int len, struct pixel * __restrict p, struct pixel * __restrict x)
{
  len = len & ~31;
  for (int i = 0; i < len; i++){
      p[i].r = x[i].r * 2;
      p[i].g = x[i].g * 3;
      p[i].b = x[i].b * 4;
  }
}

Yes the same code with line 11 changed to:

p[i].g = x[i].g;

does generate a vld3.

Comment 1 Richard Biener 2015-08-25 09:14:08 UTC

Confirmed.  We go down the SLP path here because the vectorizer thinks that
SLP is always cheaper than using interleaving (which generally is true
if there were not targets which can do the load plus interleave with
load-lanes ...).

I think this may be a regression as well because I enhanced SLP to apply
to way more cases.

Note that my plan is to make the vectorizer consider both (well, not really,
but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
on costs which route to go.

Comment 2 Michael Collison 2015-08-25 09:57:51 UTC

Richard,

Should I create a test case that fails until you resolve this in GCC 6?

On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
>
> Richard Biener <rguenth at gcc dot gnu.org> changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>               Status|UNCONFIRMED                 |ASSIGNED
>     Last reconfirmed|                            |2015-08-25
>                   CC|richard.guenther at gmail dot com  |rguenth at gcc dot gnu.org
>           Depends on|                            |66721
>             Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
>       Ever confirmed|0                           |1
>
> --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Confirmed.  We go down the SLP path here because the vectorizer thinks that
> SLP is always cheaper than using interleaving (which generally is true
> if there were not targets which can do the load plus interleave with
> load-lanes ...).
>
> I think this may be a regression as well because I enhanced SLP to apply
> to way more cases.
>
> Note that my plan is to make the vectorizer consider both (well, not really,
> but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
> on costs which route to go.
>
>
> Referenced Bugs:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
> [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs

Comment 3 rguenther@suse.de 2015-08-25 10:05:45 UTC

On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
> 
> --- Comment #2 from Michael Collison <michael.collison at linaro dot org> ---
> Richard,
> 
> Should I create a test case that fails until you resolve this in GCC 6?

If you can provide one that I can check in together with a fix that
would be nice.  Having it in the tree now and FAILing isn't according
to our policies.

> On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
> >
> > Richard Biener <rguenth at gcc dot gnu.org> changed:
> >
> >             What    |Removed                     |Added
> > ----------------------------------------------------------------------------
> >               Status|UNCONFIRMED                 |ASSIGNED
> >     Last reconfirmed|                            |2015-08-25
> >                   CC|richard.guenther at gmail dot com  |rguenth at gcc dot gnu.org
> >           Depends on|                            |66721
> >             Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
> >       Ever confirmed|0                           |1
> >
> > --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
> > Confirmed.  We go down the SLP path here because the vectorizer thinks that
> > SLP is always cheaper than using interleaving (which generally is true
> > if there were not targets which can do the load plus interleave with
> > load-lanes ...).
> >
> > I think this may be a regression as well because I enhanced SLP to apply
> > to way more cases.
> >
> > Note that my plan is to make the vectorizer consider both (well, not really,
> > but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
> > on costs which route to go.
> >
> >
> > Referenced Bugs:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
> > [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
> 
>

Comment 4 Michael Collison 2015-08-25 10:14:53 UTC

Hi Richard,

No I do not have a fix now. Thanks for the info on the policy.

On 08/25/2015 03:05 AM, rguenther at suse dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
>
> --- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
> On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote:
>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
>>
>> --- Comment #2 from Michael Collison <michael.collison at linaro dot org> ---
>> Richard,
>>
>> Should I create a test case that fails until you resolve this in GCC 6?
> If you can provide one that I can check in together with a fix that
> would be nice.  Having it in the tree now and FAILing isn't according
> to our policies.
>
>> On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
>>>
>>> Richard Biener <rguenth at gcc dot gnu.org> changed:
>>>
>>>              What    |Removed                     |Added
>>> ----------------------------------------------------------------------------
>>>                Status|UNCONFIRMED                 |ASSIGNED
>>>      Last reconfirmed|                            |2015-08-25
>>>                    CC|richard.guenther at gmail dot com  |rguenth at gcc dot gnu.org
>>>            Depends on|                            |66721
>>>              Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
>>>        Ever confirmed|0                           |1
>>>
>>> --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
>>> Confirmed.  We go down the SLP path here because the vectorizer thinks that
>>> SLP is always cheaper than using interleaving (which generally is true
>>> if there were not targets which can do the load plus interleave with
>>> load-lanes ...).
>>>
>>> I think this may be a regression as well because I enhanced SLP to apply
>>> to way more cases.
>>>
>>> Note that my plan is to make the vectorizer consider both (well, not really,
>>> but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
>>> on costs which route to go.
>>>
>>>
>>> Referenced Bugs:
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
>>> [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
>>

Comment 5 Richard Biener 2015-10-07 11:37:31 UTC

I note that the efficiency you gain is only by a reduced number of loads/store
instructions.  vld3 instead of six vldr (huh, appearantly vld3 can load 16
byte vectors but vldr only 8 byte ones?).  I assume vld3 has no penalty
for the lane-split itself so the code-size reduction is always wanted.
Thus we'd want to always use a lane load/store even if the permutation is
pointless as soon as we'd otherwise would issue more than one SLP load, say for

void
t5 (int len, int * __restrict p, int * __restrict q)
{
  for (int i = 0; i < len; i+=8) {
      p[i] = q[i] * 2;
      p[i+1] = q[i+1] * 2;
      p[i+2] = q[i+2] * 2;
      p[i+3] = q[i+3] * 2;
      p[i+4] = q[i+4] * 2;
      p[i+5] = q[i+5] * 2;
      p[i+6] = q[i+6] * 2;
      p[i+7] = q[i+7] * 2;
  }
}

instead of

.L4:
        vldr    d18, [r2, #-16]
        vldr    d19, [r2, #-8]
        vldr    d16, [r2, #-32]
        vldr    d17, [r2, #-24]
        vshl.i32        q9, q9, #1
        vshl.i32        q8, q8, #1
        add     r3, r3, #1
        cmp     r0, r3
        vstr    d18, [r1, #-16]
        vstr    d19, [r1, #-8]
        vstr    d16, [r1, #-32]
        vstr    d17, [r1, #-24]
        add     r2, r2, #32
        add     r1, r1, #32
        bhi     .L4

use vld2.32 / vst2.32?  Generally for SLP the implicit permute performed
by those instructions could be modeled properly (and the SLP chain
permuted accordingly).

Comment 6 Richard Biener 2015-12-14 15:33:52 UTC

Author: rguenth
Date: Mon Dec 14 15:33:20 2015
New Revision: 231620

URL: https://gcc.gnu.org/viewcvs?rev=231620&root=gcc&view=rev
Log:
2015-12-10  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/68707
	PR tree-optimization/67323
	* tree-vect-slp.c (vect_analyze_slp_instance): Drop SLP instances
	if they can be vectorized using load/store-lane instructions.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-vect-slp.c

Comment 7 Richard Biener 2016-01-12 14:24:59 UTC

Should be fixed now.

Comment 8 Michael Collison 2016-01-14 05:02:01 UTC

Hi Richard,

I tried this with trunk and was unable to generate the vld3. What vectorizer options did you use?

Comment 9 Andrew Pinski 2016-01-14 05:12:43 UTC

Note the question comes here is which is better using ldr/str followed by a few mult or ld3/st3 followed by a few shifts/adds.  I think it depends on the micro-arch really (at least for aarch32).  In fact I think ldr/str followed by a few mult is much better for ThunderX and most likely also Cortex-A57 (at least that is how I read the optimizing manual).

Comment 10 Richard Biener 2016-01-14 08:48:17 UTC

(In reply to Michael Collison from comment #8)
> Hi Richard,
> 
> I tried this with trunk and was unable to generate the vld3. What vectorizer
> options did you use?

Ah, I just assumed it was fixed because the patch for PR68707 was checked in.
But that conditions the "fix" on the SLP needing permutations which doesn't
trigger here.

Let's re-open then.

As asked in that other PR the question is if vld3/std3 is really cheaper
(it's definitely smaller code).

Comment 11 Michael Collison 2016-01-14 18:59:12 UTC

Andrew,

It may be the case that is not a win on all microarchitectures however I think we should allow the vectorizer to (optionally) generate the vld3 and deal with the differences via the cost models.