The new PR115528 test case 'gfortran.dg/vect/pr115528.f' FAILs its execution test for GCN target (tested '-march=gfx908'): +PASS: gfortran.dg/vect/pr115528.f -O (test for excess errors) +FAIL: gfortran.dg/vect/pr115528.f -O execution test spawn -ignore SIGHUP [...]/build-gcc/gcc/gcn-run ./pr115528.exe Memory access fault by GPU node-2 (Agent handle: 0x1834c30) on address 0x7f67e6dff000. Reason: Page not present or supervisor privilege. FAIL: gfortran.dg/vect/pr115528.f -O execution test
For GCN target (tested '-march=gfx908'), we've got identical code ('pr115528.exe') before vs. after the commit r15-1582-g2f83ea87ee328d337f87d4430861221be9babe1e "tree-optimization/115528 - fix vect alignment analysis for outer loop vect" code changes, so it's likely an unrelated issue?
It doesn't look like a misaligned access at least. Instead it looks like an out-of-bound access? OTOH I probably missed the fortran equivalent of #include "tree-vect.h" and check_vect() in main. We do vectorize the outer loop, but with partial vectors (huh, it iterates exactly 4 times). I'm not exactly sure the outer loop mask is the correct one to use in the inner loop though (but we do that). And this might explain the issue - we're definitely accessing excess elements of AA in the inner loop. In particular we disallowed grouped accesses in inner loops but this load is basically treated as a grouped access by means of having a load permutation. Test coverage seems to be weak here and this restriction should be lifted ideally (I remember issues with the IV update, but that only was for the multiple-types case which is rejected separately). I'm not sure how one would deal with an outer .WHILE_ULT loop mask in the inner loop. This is AA(I,J) with I evolving in the outer loop and J in the inner loop. The outer loop evolution is contiguous, the inner loop evolution applies a stride of four. If you force GCN to use fixed length vectors (how?), does it work? How's it behaving on aarch64 with SVE? (the CI was happy, but maybe doesn't enable SVE) Confirmed as in, it looks like wrong generated code.
(In reply to Richard Biener from comment #2) > If you force GCN to use fixed length vectors (how?), does it work? How's > it behaving on aarch64 with SVE? (the CI was happy, but maybe doesn't > enable SVE) I believe "--param vect-partial-vector-usage=0" will disable the use of WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked the value is inherited from the host in the offload toolchain; the default for x86_64 was "1", meaning approximately "only use partial vectors in epilogue loops".
On Tue, 25 Jun 2024, ams at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #3 from Andrew Stubbs <ams at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #2) > > If you force GCN to use fixed length vectors (how?), does it work? How's > > it behaving on aarch64 with SVE? (the CI was happy, but maybe doesn't > > enable SVE) > > I believe "--param vect-partial-vector-usage=0" will disable the use of > WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked > the value is inherited from the host in the offload toolchain; the default for > x86_64 was "1", meaning approximately "only use partial vectors in epilogue > loops". For unknown reasons x86-64 refuses to use partial vectors.
Turns out, this is a regression after all: before commit r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2 "tree-optimization/114107 - avoid peeling for gaps in more cases", this test case 'gfortran.dg/vect/pr115528.f' (which, of course, wasn't in-tree back then) does PASS its execution test for GCN target (tested '-march=gfx908').
On Tue, 25 Jun 2024, tschwinge at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > Thomas Schwinge <tschwinge at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|GCN: FAIL: |[15 Regression] GCN: FAIL: > |gfortran.dg/vect/pr115528.f |gfortran.dg/vect/pr115528.f > | -O execution test | -O execution test > See Also| |https://gcc.gnu.org/bugzill > | |a/show_bug.cgi?id=114107 > > --- Comment #5 from Thomas Schwinge <tschwinge at gcc dot gnu.org> --- > Turns out, this is a regression after all: before commit > r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2 "tree-optimization/114107 - > avoid peeling for gaps in more cases", this test case > 'gfortran.dg/vect/pr115528.f' (which, of course, wasn't in-tree back then) does > PASS its execution test for GCN target (tested '-march=gfx908'). Yes, the testcase outer loop wasn't vectorized before due to the "gap" access in the inner loop. With this rev. we can now vectorize the inner loop without touching the gap. Note I'm no longer sure the loop mask use is what's wrong. Instead it looks like the IV for 'aa' is off (just like I remember what happens when there is a grouped access in the inner loop): # vectp_aa.21_73 = PHI <vectp_aa.21_72(8), vectp_aa.21_75(3)> ... vect__14.23_71 = .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77); vectp_aa.21_70 = vectp_aa.21_73 + 18446744073709551360; // -256 ... vectp_aa.21_72 = vectp_aa.21_70 + 32; for the inner loop - but the inner loop should advance by 32. So I think the issue is a latent one, but maybe one that couldn't be triggered. _Possibly_ it could be triggered with V2mode vectors. For an inner loop with a grouped access this case is still disqualified but IIRC it has the same issue.
I will have a look (and for run validation try to reproduce with gfx1036).
(In reply to Richard Biener from comment #7) > I will have a look (and for run validation try to reproduce with gfx1036). OK, so with gfx1036 we end up using 16 byte vectors and the testcase passes. The difference with gfx908 is /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note: ==> examining statement: _14 = aa[_13]; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note: vect_model_load_cost: aligned. /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note: vect_model_load_cost: inside_cost = 2, prologue_cost = 0 . vs. /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note: ==> examining statement: _14 = aa[_13]; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: missed: unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 } /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: missed: unsupported load permutation /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72: missed: not vectorized: relevant stmt not supported: _14 = aa[_13]; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note: removing SLP instance operations starting from: REALPART_EXPR <(*hadcur_24(D))[_2]> = _86; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: missed: unsupported SLP instances /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: note: re-trying with SLP disabled so gfx1036 cannot do such permutes but gfx908 can? On aarch64 with SVE we are using non-SLP and we're doing load-lanes in the outer loop. The reason seems to be also the unsupported load permutation, but that's possibly because of VLA vectors - GCN uses fixed size but loop masking. So the better equivalent would have been x86-64 with loop masking. So looking again I think the loop mask in the inner loop is wrong. We have do i = 1,4 do j = 1,4 HADCUR(I)= $ HADCUR(I)+CMPLX(COEF1)*FORM1*AA(I,J) end do end do and the vectorizer sees <bb 3> [local count: 214748368]: # i_35 = PHI <i_27(7), 1(2)> # ivtmp_82 = PHI <ivtmp_81(7), 4(2)> _1 = (integer(kind=8)) i_35; _2 = _1 + -1; hadcur__I_RE_lsm.15_8 = REALPART_EXPR <(*hadcur_24(D))[_2]>; hadcur__I_IM_lsm.16_9 = IMAGPART_EXPR <(*hadcur_24(D))[_2]>; <bb 4> [local count: 858993456]: # j_36 = PHI <j_26(8), 1(3)> ... _10 = (integer(kind=8)) j_36; _11 = _10 * 4; _12 = _1 + _11; _13 = _12 + -5; _14 = aa[_13]; ... j_26 = j_36 + 1; <bb 5> [local count: 214748368]: # _86 = PHI <_49(4)> # _85 = PHI <_50(4)> REALPART_EXPR <(*hadcur_24(D))[_2]> = _86; IMAGPART_EXPR <(*hadcur_24(D))[_2]> = _85; i_27 = i_35 + 1; the loop mask { -1, -1, -1, -1, -1, -1, -1, -1, 0, .... } is OK for the outer loop grouped load vect_hadcur__I_RE_lsm.20_76 = .MASK_LOAD (vectp_hadcur.18_79, 64B, loop_mask_77); but for the inner loop we do vect__14.23_71 = .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77); with the same mask. This fails to be pruned for the GAP which means that my improving of gap handling relies for this case to not end up in the masked load handling. In fact get_group_load_store_type doesn't seem to be prepared for outer loop vectorization. OTOH the inner loop isn't "unrolled" (it has a VF of 1), and this might be a mistake of loop mask handling and bad re-use. As was said elsewhere outer loop vectorization with inner loop datarefs is compensating for a missed interchange.
Created attachment 58519 [details] patch I think this fixes it, but I cannot validate.
On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #7) >> I will have a look (and for run validation try to reproduce with gfx1036). > > OK, so with gfx1036 we end up using 16 byte vectors and the testcase > passes. The difference with gfx908 is > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: ==> examining statement: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: vect_model_load_cost: aligned. > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: vect_model_load_cost: inside_cost = 2, prologue_cost = 0 . > > vs. > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: ==> examining statement: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 > 10 11 11 12 12 13 13 14 14 15 15 } > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported load permutation > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72: > missed: not vectorized: relevant stmt not supported: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: removing SLP instance operations starting from: REALPART_EXPR > <(*hadcur_24(D))[_2]> = _86; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported SLP instances > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: re-trying with SLP disabled > > so gfx1036 cannot do such permutes but gfx908 can? GFX10 has more limited permutation capabilities than GFX9 because it only has 32-lane vectors natively, even though we're using the 64-lane "compatibility" mode. However, in theory, the permutation capabilities on V32 and below should be the same, and some permutations on V64 are allowed, so I don't know why it doesn't use it. It's possible I broke the logic in gcn_vectorize_vec_perm_const: /* RDNA devices can only do permutations within each group of 32-lanes. Reject permutations that cross the boundary. */ if (TARGET_RDNA2_PLUS) for (unsigned int i = 0; i < nelt; i++) if (i < 31 ? perm[i] > 31 : perm[i] < 32) return false; It looks right to me though? The vec_extract patterns that also use permutations are likewise supposedly still enabled for V32 and below. Andrew
(In reply to Richard Biener from comment #9) > Created attachment 58519 [details] > patch > > I think this fixes it, but I cannot validate. Yes, it does, thanks!
(In reply to Andrew Stubbs from comment #10) > On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > > > --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- > > (In reply to Richard Biener from comment #7) > >> I will have a look (and for run validation try to reproduce with gfx1036). > > > > OK, so with gfx1036 we end up using 16 byte vectors and the testcase > > passes. The difference with gfx908 is > > > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > note: ==> examining statement: _14 = aa[_13]; > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > note: vect_model_load_cost: aligned. > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > note: vect_model_load_cost: inside_cost = 2, prologue_cost = 0 . > > > > vs. > > > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > note: ==> examining statement: _14 = aa[_13]; > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > missed: unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 > > 10 11 11 12 12 13 13 14 14 15 15 } > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > missed: unsupported load permutation > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72: > > missed: not vectorized: relevant stmt not supported: _14 = aa[_13]; > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > note: removing SLP instance operations starting from: REALPART_EXPR > > <(*hadcur_24(D))[_2]> = _86; > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > missed: unsupported SLP instances > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > > note: re-trying with SLP disabled > > > > so gfx1036 cannot do such permutes but gfx908 can? > > GFX10 has more limited permutation capabilities than GFX9 because it > only has 32-lane vectors natively, even though we're using the 64-lane > "compatibility" mode. > > However, in theory, the permutation capabilities on V32 and below should > be the same, and some permutations on V64 are allowed, so I don't know > why it doesn't use it. It's possible I broke the logic in > gcn_vectorize_vec_perm_const: > > /* RDNA devices can only do permutations within each group of 32-lanes. > Reject permutations that cross the boundary. */ > if (TARGET_RDNA2_PLUS) > for (unsigned int i = 0; i < nelt; i++) > if (i < 31 ? perm[i] > 31 : perm[i] < 32) > return false; > > It looks right to me though? nelt == 32 so I think the last element has the wrong check applied? It should be > if (i < 32 ? perm[i] > 31 : perm[i] < 32) I think. With that the vectorization happens in a similar way but the failure still doesn't reproduce (without the patch, of course). > The vec_extract patterns that also use permutations are likewise > supposedly still enabled for V32 and below. > > Andrew
(In reply to Richard Biener from comment #12) > (In reply to Andrew Stubbs from comment #10) > > GFX10 has more limited permutation capabilities than GFX9 because it > > only has 32-lane vectors natively, even though we're using the 64-lane > > "compatibility" mode. > > > > However, in theory, the permutation capabilities on V32 and below should > > be the same, and some permutations on V64 are allowed, so I don't know > > why it doesn't use it. It's possible I broke the logic in > > gcn_vectorize_vec_perm_const: > > > > /* RDNA devices can only do permutations within each group of 32-lanes. > > Reject permutations that cross the boundary. */ > > if (TARGET_RDNA2_PLUS) > > for (unsigned int i = 0; i < nelt; i++) > > if (i < 31 ? perm[i] > 31 : perm[i] < 32) > > return false; > > > > It looks right to me though? > > nelt == 32 so I think the last element has the wrong check applied? > > It should be > > > if (i < 32 ? perm[i] > 31 : perm[i] < 32) > > I think. With that the vectorization happens in a similar way but the > failure still doesn't reproduce (without the patch, of course). Btw, the above looks quite odd for nelt == 32 anyway - we are permuting two vectors src0 and src1 into one 32 element dst vector (it's no longer required that src0 and src1 line up with the dst vector size btw, they might have different nelt). So the loop would reject interleaving the low parts of two 32 element vectors, a permute that would look like { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" mean you can never mix the two vector inputs? Or does GCN not have a two-to-one vector permute instruction?
On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #12) >> (In reply to Andrew Stubbs from comment #10) >>> GFX10 has more limited permutation capabilities than GFX9 because it >>> only has 32-lane vectors natively, even though we're using the 64-lane >>> "compatibility" mode. >>> >>> However, in theory, the permutation capabilities on V32 and below should >>> be the same, and some permutations on V64 are allowed, so I don't know >>> why it doesn't use it. It's possible I broke the logic in >>> gcn_vectorize_vec_perm_const: >>> >>> /* RDNA devices can only do permutations within each group of 32-lanes. >>> Reject permutations that cross the boundary. */ >>> if (TARGET_RDNA2_PLUS) >>> for (unsigned int i = 0; i < nelt; i++) >>> if (i < 31 ? perm[i] > 31 : perm[i] < 32) >>> return false; >>> >>> It looks right to me though? >> >> nelt == 32 so I think the last element has the wrong check applied? >> >> It should be >> >>> if (i < 32 ? perm[i] > 31 : perm[i] < 32) >> >> I think. With that the vectorization happens in a similar way but the >> failure still doesn't reproduce (without the patch, of course). Oops, I think you're right. > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting > two vectors src0 and src1 into one 32 element dst vector (it's no longer > required that src0 and src1 line up with the dst vector size btw, they > might have different nelt). So the loop would reject interleaving > the low parts of two 32 element vectors, a permute that would look like > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" > mean you can never mix the two vector inputs? Or does GCN not have > a two-to-one vector permute instruction? GCN does not have two-to-one vector permute in hardware, so we do two permutes and a vec_merge to get the same effect. GFX9 can permute all the elements within a 64 lane vector arbitrarily. GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but no value may cross the boundary. AFAIK there's no way to do that via any vector instruction (i.e. without writing to memory, or extracting values element-wise). In theory, we could implement permutes with different sized inputs and outputs, but right now those are rejected early. The interleave example wouldn't work in hardware, for GFX10, but we could have it for GFX9. However, I think you might be right about the numbering of the "perm" array; we probably need to be testing "(perm[i] % nelt) > 31" if we are to support two-to-one permutations. Thanks for looking at this. Andrew
On Wed, 26 Jun 2024, ams at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #14 from Andrew Stubbs <ams at gcc dot gnu.org> --- > On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > > > --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- > > (In reply to Richard Biener from comment #12) > >> (In reply to Andrew Stubbs from comment #10) > >>> GFX10 has more limited permutation capabilities than GFX9 because it > >>> only has 32-lane vectors natively, even though we're using the 64-lane > >>> "compatibility" mode. > >>> > >>> However, in theory, the permutation capabilities on V32 and below should > >>> be the same, and some permutations on V64 are allowed, so I don't know > >>> why it doesn't use it. It's possible I broke the logic in > >>> gcn_vectorize_vec_perm_const: > >>> > >>> /* RDNA devices can only do permutations within each group of 32-lanes. > >>> Reject permutations that cross the boundary. */ > >>> if (TARGET_RDNA2_PLUS) > >>> for (unsigned int i = 0; i < nelt; i++) > >>> if (i < 31 ? perm[i] > 31 : perm[i] < 32) > >>> return false; > >>> > >>> It looks right to me though? > >> > >> nelt == 32 so I think the last element has the wrong check applied? > >> > >> It should be > >> > >>> if (i < 32 ? perm[i] > 31 : perm[i] < 32) > >> > >> I think. With that the vectorization happens in a similar way but the > >> failure still doesn't reproduce (without the patch, of course). > > Oops, I think you're right. > > > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting > > two vectors src0 and src1 into one 32 element dst vector (it's no longer > > required that src0 and src1 line up with the dst vector size btw, they > > might have different nelt). So the loop would reject interleaving > > the low parts of two 32 element vectors, a permute that would look like > > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" > > mean you can never mix the two vector inputs? Or does GCN not have > > a two-to-one vector permute instruction? > > GCN does not have two-to-one vector permute in hardware, so we do two > permutes and a vec_merge to get the same effect. > > GFX9 can permute all the elements within a 64 lane vector arbitrarily. > > GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but > no value may cross the boundary. AFAIK there's no way to do that via any > vector instruction (i.e. without writing to memory, or extracting values > element-wise). I see - so it cannot even swap low-32 and high-32? I'm thinking of what sub-part of permutes would be possible by extending the two-to-one vec_merge trick. OTOH we restrict GFX10/11 to 32 lane vectors so in practice this restriction should be fine.
On 26/06/2024 14:41, rguenther at suse dot de wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> --- >>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting >>> two vectors src0 and src1 into one 32 element dst vector (it's no longer >>> required that src0 and src1 line up with the dst vector size btw, they >>> might have different nelt). So the loop would reject interleaving >>> the low parts of two 32 element vectors, a permute that would look like >>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" >>> mean you can never mix the two vector inputs? Or does GCN not have >>> a two-to-one vector permute instruction? >> >> GCN does not have two-to-one vector permute in hardware, so we do two >> permutes and a vec_merge to get the same effect. >> >> GFX9 can permute all the elements within a 64 lane vector arbitrarily. >> >> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but >> no value may cross the boundary. AFAIK there's no way to do that via any >> vector instruction (i.e. without writing to memory, or extracting values >> element-wise). > > I see - so it cannot even swap low-32 and high-32? I'm thinking of > what sub-part of permutes would be possible by extending the two-to-one > vec_merge trick. No(?) The 64-lane compatibility mode works, under the hood, by allocating double the number of 32-lane registers and then executing each instruction twice. Mostly this is invisible, but it gets exposed for permutations and the like. Logically, the microarchitecture could do a vec_merge to DTRT, but I've not found a way to express that. It's possible I missed something when RTFM. > OTOH we restrict GFX10/11 to 32 lane vectors so in practice this > restriction should be fine. Yes, with the "31" fixed it should work. Andrew
The master branch has been updated by Andrew Stubbs <ams@gcc.gnu.org>: https://gcc.gnu.org/g:ef0b30212f7756db15d7507bfd871bf377d7d648 commit r15-1705-gef0b30212f7756db15d7507bfd871bf377d7d648 Author: Andrew Stubbs <ams@baylibre.com> Date: Fri Jun 28 10:47:50 2024 +0000 amdgcn: Fix RDNA V32 permutations [PR115640] There was an off-by-one error in the RDNA validation check, plus I forgot to allow for two-to-one permute-and-merge operations. PR target/115640 gcc/ChangeLog: * config/gcn/gcn.cc (gcn_vectorize_vec_perm_const): Modify RDNA checks.
That should fix the broken validation check. All V32 permutations should work now on RDNA GPUs, I think. V16 and smaller were already working fine.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:0192341a07b8ea30f631cf4afdc6fcf3fa7ce838 commit r15-1706-g0192341a07b8ea30f631cf4afdc6fcf3fa7ce838 Author: Richard Biener <rguenther@suse.de> Date: Wed Jun 26 14:07:51 2024 +0200 tree-optimization/115640 - outer loop vect with inner SLP permute The following fixes wrong-code when using outer loop vectorization and an inner loop SLP access with permutation. A wrong adjustment to the IV increment is then applied on GCN. PR tree-optimization/115640 * tree-vect-stmts.cc (vectorizable_load): With an inner loop SLP access to not apply a gap adjustment.
The wrong-code issue should be fixed now.