Bug 115640 - [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
Summary: [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: 15.0
Assignee: Richard Biener
URL:
Keywords: testsuite-fail
Depends on:
Blocks:
 
Reported: 2024-06-25 11:39 UTC by Thomas Schwinge
Modified: 2024-06-28 11:45 UTC (History)
4 users (show)

See Also:
Host:
Target: GCN
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-06-25 00:00:00


Attachments
patch (502 bytes, patch)
2024-06-26 11:06 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Schwinge 2024-06-25 11:39:49 UTC
The new PR115528 test case 'gfortran.dg/vect/pr115528.f' FAILs its execution test for GCN target (tested '-march=gfx908'):

    +PASS: gfortran.dg/vect/pr115528.f   -O  (test for excess errors)
    +FAIL: gfortran.dg/vect/pr115528.f   -O  execution test

    spawn -ignore SIGHUP [...]/build-gcc/gcc/gcn-run ./pr115528.exe
    Memory access fault by GPU node-2 (Agent handle: 0x1834c30) on address 0x7f67e6dff000. Reason: Page not present or supervisor privilege.
    FAIL: gfortran.dg/vect/pr115528.f   -O  execution test
Comment 1 Thomas Schwinge 2024-06-25 11:56:03 UTC
For GCN target (tested '-march=gfx908'), we've got identical code ('pr115528.exe') before vs. after the commit r15-1582-g2f83ea87ee328d337f87d4430861221be9babe1e "tree-optimization/115528 - fix vect alignment analysis for outer loop vect" code changes, so it's likely an unrelated issue?
Comment 2 Richard Biener 2024-06-25 12:30:20 UTC
It doesn't look like a misaligned access at least.  Instead it looks like an out-of-bound access?

OTOH I probably missed the fortran equivalent of #include "tree-vect.h" and check_vect() in main.

We do vectorize the outer loop, but with partial vectors (huh, it iterates
exactly 4 times).  I'm not exactly sure the outer loop mask is the correct
one to use in the inner loop though (but we do that).  And this might explain
the issue - we're definitely accessing excess elements of AA in the inner
loop.

In particular we disallowed grouped accesses in inner loops but this
load is basically treated as a grouped access by means of having a load
permutation.  Test coverage seems to be weak here and this restriction
should be lifted ideally (I remember issues with the IV update, but that
only was for the multiple-types case which is rejected separately).

I'm not sure how one would deal with an outer .WHILE_ULT loop mask in the
inner loop.  This is AA(I,J) with I evolving in the outer loop and J
in the inner loop.  The outer loop evolution is contiguous, the inner
loop evolution applies a stride of four.

If you force GCN to use fixed length vectors (how?), does it work?  How's
it behaving on aarch64 with SVE?  (the CI was happy, but maybe doesn't
enable SVE)

Confirmed as in, it looks like wrong generated code.
Comment 3 Andrew Stubbs 2024-06-25 12:36:32 UTC
(In reply to Richard Biener from comment #2)
> If you force GCN to use fixed length vectors (how?), does it work?  How's
> it behaving on aarch64 with SVE?  (the CI was happy, but maybe doesn't
> enable SVE)

I believe "--param vect-partial-vector-usage=0" will disable the use of WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked the value is inherited from the host in the offload toolchain; the default for x86_64 was "1", meaning approximately "only use partial vectors in epilogue loops".
Comment 4 rguenther@suse.de 2024-06-25 13:08:31 UTC
On Tue, 25 Jun 2024, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #3 from Andrew Stubbs <ams at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #2)
> > If you force GCN to use fixed length vectors (how?), does it work?  How's
> > it behaving on aarch64 with SVE?  (the CI was happy, but maybe doesn't
> > enable SVE)
> 
> I believe "--param vect-partial-vector-usage=0" will disable the use of
> WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked
> the value is inherited from the host in the offload toolchain; the default for
> x86_64 was "1", meaning approximately "only use partial vectors in epilogue
> loops".

For unknown reasons x86-64 refuses to use partial vectors.
Comment 5 Thomas Schwinge 2024-06-25 17:55:25 UTC
Turns out, this is a regression after all: before commit r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2 "tree-optimization/114107 - avoid peeling for gaps in more cases", this test case 'gfortran.dg/vect/pr115528.f' (which, of course, wasn't in-tree back then) does PASS its execution test for GCN target (tested '-march=gfx908').
Comment 6 rguenther@suse.de 2024-06-26 06:38:30 UTC
On Tue, 25 Jun 2024, tschwinge at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> Thomas Schwinge <tschwinge at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>             Summary|GCN: FAIL:                  |[15 Regression] GCN: FAIL:
>                    |gfortran.dg/vect/pr115528.f |gfortran.dg/vect/pr115528.f
>                    |  -O  execution test        |  -O  execution test
>            See Also|                            |https://gcc.gnu.org/bugzill
>                    |                            |a/show_bug.cgi?id=114107
> 
> --- Comment #5 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
> Turns out, this is a regression after all: before commit
> r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2 "tree-optimization/114107 -
> avoid peeling for gaps in more cases", this test case
> 'gfortran.dg/vect/pr115528.f' (which, of course, wasn't in-tree back then) does
> PASS its execution test for GCN target (tested '-march=gfx908').

Yes, the testcase outer loop wasn't vectorized before due to the
"gap" access in the inner loop.  With this rev. we can now vectorize
the inner loop without touching the gap.  Note I'm no longer sure
the loop mask use is what's wrong.  Instead it looks like the IV for
'aa' is off (just like I remember what happens when there is a grouped
access in the inner loop):

  # vectp_aa.21_73 = PHI <vectp_aa.21_72(8), vectp_aa.21_75(3)>
...
  vect__14.23_71 = .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77);
  vectp_aa.21_70 = vectp_aa.21_73 + 18446744073709551360; // -256
...
  vectp_aa.21_72 = vectp_aa.21_70 + 32;

for the inner loop - but the inner loop should advance by 32.

So I think the issue is a latent one, but maybe one that couldn't be
triggered.  _Possibly_ it could be triggered with V2mode vectors.

For an inner loop with a grouped access this case is still disqualified
but IIRC it has the same issue.
Comment 7 Richard Biener 2024-06-26 07:40:19 UTC
I will have a look (and for run validation try to reproduce with gfx1036).
Comment 8 Richard Biener 2024-06-26 11:05:56 UTC
(In reply to Richard Biener from comment #7)
> I will have a look (and for run validation try to reproduce with gfx1036).

OK, so with gfx1036 we end up using 16 byte vectors and the testcase
passes.  The difference with gfx908 is

/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note:   ==> examining statement: _14 = aa[_13];
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note:   vect_model_load_cost: aligned.
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note:   vect_model_load_cost: inside_cost = 2, prologue_cost = 0 .

vs.

/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note:   ==> examining statement: _14 = aa[_13];
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: missed:   unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 }
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: missed:   unsupported load permutation
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72: missed:   not vectorized: relevant stmt not supported: _14 = aa[_13];
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note:   removing SLP instance operations starting from: REALPART_EXPR <(*hadcur_24(D))[_2]> = _86;
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: missed:  unsupported SLP instances
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note:  re-trying with SLP disabled

so gfx1036 cannot do such permutes but gfx908 can?

On aarch64 with SVE we are using non-SLP and we're doing load-lanes in the
outer loop.  The reason seems to be also the unsupported load permutation,
but that's possibly because of VLA vectors - GCN uses fixed size but
loop masking.  So the better equivalent would have been x86-64 with loop
masking.

So looking again I think the loop mask in the inner loop is wrong.  We have

      do i = 1,4
         do j = 1,4
            HADCUR(I)=
     $         HADCUR(I)+CMPLX(COEF1)*FORM1*AA(I,J)
         end do
      end do

and the vectorizer sees

  <bb 3> [local count: 214748368]:
  # i_35 = PHI <i_27(7), 1(2)>
  # ivtmp_82 = PHI <ivtmp_81(7), 4(2)>
  _1 = (integer(kind=8)) i_35;
  _2 = _1 + -1;
  hadcur__I_RE_lsm.15_8 = REALPART_EXPR <(*hadcur_24(D))[_2]>;
  hadcur__I_IM_lsm.16_9 = IMAGPART_EXPR <(*hadcur_24(D))[_2]>;

  <bb 4> [local count: 858993456]:
  # j_36 = PHI <j_26(8), 1(3)>
...
  _10 = (integer(kind=8)) j_36;
  _11 = _10 * 4;
  _12 = _1 + _11;
  _13 = _12 + -5;
  _14 = aa[_13];
...
  j_26 = j_36 + 1;

  <bb 5> [local count: 214748368]:
  # _86 = PHI <_49(4)>
  # _85 = PHI <_50(4)>
  REALPART_EXPR <(*hadcur_24(D))[_2]> = _86;
  IMAGPART_EXPR <(*hadcur_24(D))[_2]> = _85;
  i_27 = i_35 + 1;

the loop mask { -1, -1, -1, -1, -1, -1, -1, -1, 0, .... } is OK for
the outer loop grouped load

  vect_hadcur__I_RE_lsm.20_76 = .MASK_LOAD (vectp_hadcur.18_79, 64B, loop_mask_77);

but for the inner loop we do

  vect__14.23_71 = .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77);

with the same mask.  This fails to be pruned for the GAP which means
that my improving of gap handling relies for this case to not end up
in the masked load handling.  In fact get_group_load_store_type doesn't
seem to be prepared for outer loop vectorization.  OTOH the inner loop
isn't "unrolled" (it has a VF of 1), and this might be a mistake of
loop mask handling and bad re-use.

As was said elsewhere outer loop vectorization with inner loop datarefs
is compensating for a missed interchange.
Comment 9 Richard Biener 2024-06-26 11:06:28 UTC
Created attachment 58519 [details]
patch

I think this fixes it, but I cannot validate.
Comment 10 Andrew Stubbs 2024-06-26 11:30:50 UTC
On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #7)
>> I will have a look (and for run validation try to reproduce with gfx1036).
> 
> OK, so with gfx1036 we end up using 16 byte vectors and the testcase
> passes.  The difference with gfx908 is
> 
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   ==> examining statement: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   vect_model_load_cost: aligned.
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   vect_model_load_cost: inside_cost = 2, prologue_cost = 0 .
> 
> vs.
> 
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   ==> examining statement: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:   unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10
> 10 11 11 12 12 13 13 14 14 15 15 }
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:   unsupported load permutation
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72:
> missed:   not vectorized: relevant stmt not supported: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   removing SLP instance operations starting from: REALPART_EXPR
> <(*hadcur_24(D))[_2]> = _86;
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:  unsupported SLP instances
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:  re-trying with SLP disabled
> 
> so gfx1036 cannot do such permutes but gfx908 can?

GFX10 has more limited permutation capabilities than GFX9 because it 
only has 32-lane vectors natively, even though we're using the 64-lane 
"compatibility" mode.

However, in theory, the permutation capabilities on V32 and below should 
be the same, and some permutations on V64 are allowed, so I don't know 
why it doesn't use it. It's possible I broke the logic in 
gcn_vectorize_vec_perm_const:

   /* RDNA devices can only do permutations within each group of 32-lanes.
      Reject permutations that cross the boundary.  */
   if (TARGET_RDNA2_PLUS)
     for (unsigned int i = 0; i < nelt; i++)
       if (i < 31 ? perm[i] > 31 : perm[i] < 32)
         return false;

It looks right to me though?

The vec_extract patterns that also use permutations are likewise 
supposedly still enabled for V32 and below.

Andrew
Comment 11 Thomas Schwinge 2024-06-26 11:54:24 UTC
(In reply to Richard Biener from comment #9)
> Created attachment 58519 [details]
> patch
> 
> I think this fixes it, but I cannot validate.

Yes, it does, thanks!
Comment 12 Richard Biener 2024-06-26 12:26:32 UTC
(In reply to Andrew Stubbs from comment #10)
> On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> > 
> > --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #7)
> >> I will have a look (and for run validation try to reproduce with gfx1036).
> > 
> > OK, so with gfx1036 we end up using 16 byte vectors and the testcase
> > passes.  The difference with gfx908 is
> > 
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > note:   ==> examining statement: _14 = aa[_13];
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > note:   vect_model_load_cost: aligned.
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > note:   vect_model_load_cost: inside_cost = 2, prologue_cost = 0 .
> > 
> > vs.
> > 
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > note:   ==> examining statement: _14 = aa[_13];
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > missed:   unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10
> > 10 11 11 12 12 13 13 14 14 15 15 }
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > missed:   unsupported load permutation
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72:
> > missed:   not vectorized: relevant stmt not supported: _14 = aa[_13];
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > note:   removing SLP instance operations starting from: REALPART_EXPR
> > <(*hadcur_24(D))[_2]> = _86;
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > missed:  unsupported SLP instances
> > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> > note:  re-trying with SLP disabled
> > 
> > so gfx1036 cannot do such permutes but gfx908 can?
> 
> GFX10 has more limited permutation capabilities than GFX9 because it 
> only has 32-lane vectors natively, even though we're using the 64-lane 
> "compatibility" mode.
> 
> However, in theory, the permutation capabilities on V32 and below should 
> be the same, and some permutations on V64 are allowed, so I don't know 
> why it doesn't use it. It's possible I broke the logic in 
> gcn_vectorize_vec_perm_const:
> 
>    /* RDNA devices can only do permutations within each group of 32-lanes.
>       Reject permutations that cross the boundary.  */
>    if (TARGET_RDNA2_PLUS)
>      for (unsigned int i = 0; i < nelt; i++)
>        if (i < 31 ? perm[i] > 31 : perm[i] < 32)
>          return false;
> 
> It looks right to me though?

nelt == 32 so I think the last element has the wrong check applied?

It should be

>        if (i < 32 ? perm[i] > 31 : perm[i] < 32)

I think.  With that the vectorization happens in a similar way but the
failure still doesn't reproduce (without the patch, of course).

> The vec_extract patterns that also use permutations are likewise 
> supposedly still enabled for V32 and below.
> 
> Andrew
Comment 13 Richard Biener 2024-06-26 12:34:01 UTC
(In reply to Richard Biener from comment #12)
> (In reply to Andrew Stubbs from comment #10)
> > GFX10 has more limited permutation capabilities than GFX9 because it 
> > only has 32-lane vectors natively, even though we're using the 64-lane 
> > "compatibility" mode.
> > 
> > However, in theory, the permutation capabilities on V32 and below should 
> > be the same, and some permutations on V64 are allowed, so I don't know 
> > why it doesn't use it. It's possible I broke the logic in 
> > gcn_vectorize_vec_perm_const:
> > 
> >    /* RDNA devices can only do permutations within each group of 32-lanes.
> >       Reject permutations that cross the boundary.  */
> >    if (TARGET_RDNA2_PLUS)
> >      for (unsigned int i = 0; i < nelt; i++)
> >        if (i < 31 ? perm[i] > 31 : perm[i] < 32)
> >          return false;
> > 
> > It looks right to me though?
> 
> nelt == 32 so I think the last element has the wrong check applied?
> 
> It should be
> 
> >        if (i < 32 ? perm[i] > 31 : perm[i] < 32)
> 
> I think.  With that the vectorization happens in a similar way but the
> failure still doesn't reproduce (without the patch, of course).

Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
two vectors src0 and src1 into one 32 element dst vector (it's no longer
required that src0 and src1 line up with the dst vector size btw, they
might have different nelt).  So the loop would reject interleaving
the low parts of two 32 element vectors, a permute that would look like
{ 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
mean you can never mix the two vector inputs?  Or does GCN not have
a two-to-one vector permute instruction?
Comment 14 Andrew Stubbs 2024-06-26 13:34:30 UTC
On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #12)
>> (In reply to Andrew Stubbs from comment #10)
>>> GFX10 has more limited permutation capabilities than GFX9 because it
>>> only has 32-lane vectors natively, even though we're using the 64-lane
>>> "compatibility" mode.
>>>
>>> However, in theory, the permutation capabilities on V32 and below should
>>> be the same, and some permutations on V64 are allowed, so I don't know
>>> why it doesn't use it. It's possible I broke the logic in
>>> gcn_vectorize_vec_perm_const:
>>>
>>>     /* RDNA devices can only do permutations within each group of 32-lanes.
>>>        Reject permutations that cross the boundary.  */
>>>     if (TARGET_RDNA2_PLUS)
>>>       for (unsigned int i = 0; i < nelt; i++)
>>>         if (i < 31 ? perm[i] > 31 : perm[i] < 32)
>>>           return false;
>>>
>>> It looks right to me though?
>>
>> nelt == 32 so I think the last element has the wrong check applied?
>>
>> It should be
>>
>>>         if (i < 32 ? perm[i] > 31 : perm[i] < 32)
>>
>> I think.  With that the vectorization happens in a similar way but the
>> failure still doesn't reproduce (without the patch, of course).

Oops, I think you're right.

> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
> two vectors src0 and src1 into one 32 element dst vector (it's no longer
> required that src0 and src1 line up with the dst vector size btw, they
> might have different nelt).  So the loop would reject interleaving
> the low parts of two 32 element vectors, a permute that would look like
> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
> mean you can never mix the two vector inputs?  Or does GCN not have
> a two-to-one vector permute instruction?

GCN does not have two-to-one vector permute in hardware, so we do two 
permutes and a vec_merge to get the same effect.

GFX9 can permute all the elements within a 64 lane vector arbitrarily.

GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but 
no value may cross the boundary. AFAIK there's no way to do that via any 
vector instruction (i.e. without writing to memory, or extracting values 
element-wise).

In theory, we could implement permutes with different sized inputs and 
outputs, but right now those are rejected early. The interleave example 
wouldn't work in hardware, for GFX10, but we could have it for GFX9.

However, I think you might be right about the numbering of the "perm" 
array; we probably need to be testing "(perm[i] % nelt) > 31" if we are 
to support two-to-one permutations.

Thanks for looking at this.

Andrew
Comment 15 rguenther@suse.de 2024-06-26 13:41:16 UTC
On Wed, 26 Jun 2024, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #14 from Andrew Stubbs <ams at gcc dot gnu.org> ---
> On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> > 
> > --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #12)
> >> (In reply to Andrew Stubbs from comment #10)
> >>> GFX10 has more limited permutation capabilities than GFX9 because it
> >>> only has 32-lane vectors natively, even though we're using the 64-lane
> >>> "compatibility" mode.
> >>>
> >>> However, in theory, the permutation capabilities on V32 and below should
> >>> be the same, and some permutations on V64 are allowed, so I don't know
> >>> why it doesn't use it. It's possible I broke the logic in
> >>> gcn_vectorize_vec_perm_const:
> >>>
> >>>     /* RDNA devices can only do permutations within each group of 32-lanes.
> >>>        Reject permutations that cross the boundary.  */
> >>>     if (TARGET_RDNA2_PLUS)
> >>>       for (unsigned int i = 0; i < nelt; i++)
> >>>         if (i < 31 ? perm[i] > 31 : perm[i] < 32)
> >>>           return false;
> >>>
> >>> It looks right to me though?
> >>
> >> nelt == 32 so I think the last element has the wrong check applied?
> >>
> >> It should be
> >>
> >>>         if (i < 32 ? perm[i] > 31 : perm[i] < 32)
> >>
> >> I think.  With that the vectorization happens in a similar way but the
> >> failure still doesn't reproduce (without the patch, of course).
> 
> Oops, I think you're right.
> 
> > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
> > two vectors src0 and src1 into one 32 element dst vector (it's no longer
> > required that src0 and src1 line up with the dst vector size btw, they
> > might have different nelt).  So the loop would reject interleaving
> > the low parts of two 32 element vectors, a permute that would look like
> > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
> > mean you can never mix the two vector inputs?  Or does GCN not have
> > a two-to-one vector permute instruction?
> 
> GCN does not have two-to-one vector permute in hardware, so we do two 
> permutes and a vec_merge to get the same effect.
> 
> GFX9 can permute all the elements within a 64 lane vector arbitrarily.
> 
> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but 
> no value may cross the boundary. AFAIK there's no way to do that via any 
> vector instruction (i.e. without writing to memory, or extracting values 
> element-wise).

I see - so it cannot even swap low-32 and high-32?  I'm thinking of
what sub-part of permutes would be possible by extending the two-to-one
vec_merge trick.

OTOH we restrict GFX10/11 to 32 lane vectors so in practice this
restriction should be fine.
Comment 16 Andrew Stubbs 2024-06-26 14:05:10 UTC
On 26/06/2024 14:41, rguenther at suse dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
>>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
>>> two vectors src0 and src1 into one 32 element dst vector (it's no longer
>>> required that src0 and src1 line up with the dst vector size btw, they
>>> might have different nelt).  So the loop would reject interleaving
>>> the low parts of two 32 element vectors, a permute that would look like
>>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
>>> mean you can never mix the two vector inputs?  Or does GCN not have
>>> a two-to-one vector permute instruction?
>>
>> GCN does not have two-to-one vector permute in hardware, so we do two
>> permutes and a vec_merge to get the same effect.
>>
>> GFX9 can permute all the elements within a 64 lane vector arbitrarily.
>>
>> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but
>> no value may cross the boundary. AFAIK there's no way to do that via any
>> vector instruction (i.e. without writing to memory, or extracting values
>> element-wise).
> 
> I see - so it cannot even swap low-32 and high-32?  I'm thinking of
> what sub-part of permutes would be possible by extending the two-to-one
> vec_merge trick.

No(?)

The 64-lane compatibility mode works, under the hood, by allocating 
double the number of 32-lane registers and then executing each 
instruction twice. Mostly this is invisible, but it gets exposed for 
permutations and the like. Logically, the microarchitecture could do a 
vec_merge to DTRT, but I've not found a way to express that.

It's possible I missed something when RTFM.

> OTOH we restrict GFX10/11 to 32 lane vectors so in practice this
> restriction should be fine.

Yes, with the "31" fixed it should work.

Andrew
Comment 17 GCC Commits 2024-06-28 11:08:56 UTC
The master branch has been updated by Andrew Stubbs <ams@gcc.gnu.org>:

https://gcc.gnu.org/g:ef0b30212f7756db15d7507bfd871bf377d7d648

commit r15-1705-gef0b30212f7756db15d7507bfd871bf377d7d648
Author: Andrew Stubbs <ams@baylibre.com>
Date:   Fri Jun 28 10:47:50 2024 +0000

    amdgcn: Fix RDNA V32 permutations [PR115640]
    
    There was an off-by-one error in the RDNA validation check, plus I forgot to
    allow for two-to-one permute-and-merge operations.
    
            PR target/115640
    
    gcc/ChangeLog:
    
            * config/gcn/gcn.cc (gcn_vectorize_vec_perm_const): Modify RDNA checks.
Comment 18 Andrew Stubbs 2024-06-28 11:12:02 UTC
That should fix the broken validation check. All V32 permutations should work now on RDNA GPUs, I think. V16 and smaller were already working fine.
Comment 19 GCC Commits 2024-06-28 11:44:48 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:0192341a07b8ea30f631cf4afdc6fcf3fa7ce838

commit r15-1706-g0192341a07b8ea30f631cf4afdc6fcf3fa7ce838
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jun 26 14:07:51 2024 +0200

    tree-optimization/115640 - outer loop vect with inner SLP permute
    
    The following fixes wrong-code when using outer loop vectorization
    and an inner loop SLP access with permutation.  A wrong adjustment
    to the IV increment is then applied on GCN.
    
            PR tree-optimization/115640
            * tree-vect-stmts.cc (vectorizable_load): With an inner
            loop SLP access to not apply a gap adjustment.
Comment 20 Richard Biener 2024-06-28 11:45:28 UTC
The wrong-code issue should be fixed now.