After 2019-11-14 Richard Sandiford <richard.sandiford@arm.com> * tree-cfg.c (verify_gimple_assign_unary): Handle conversions between vector types. * tree-vect-stmts.c (vectorizable_conversion): Extend the non-widening and non-narrowing path to handle standard conversion codes, if the target supports them. * expr.c (convert_move): Try using the extend and truncate optabs for vectors. * optabs-tree.c (supportable_convert_operation): Likewise. * config/aarch64/iterators.md (Vnarroqw): New iterator. * config/aarch64/aarch64-simd.md (<optab><Vnarrowq><mode>2) (trunc<mode><Vnarrowq>2): New patterns. it would now be possible to BB vectorize the following but the x86 backend lacks appropriate patterns for the vector mode variants (here V8HI -> V8QI narrowing). typedef unsigned char v16qi __attribute__((vector_size(16))); typedef unsigned short v8hi __attribute__((vector_size(16))); void bar (v8hi *dst, v16qi * __restrict src) { unsigned short tem[8]; tem[0] = (*src)[0]; tem[1] = (*src)[1]; tem[2] = (*src)[2]; tem[3] = (*src)[3]; tem[4] = (*src)[4]; tem[5] = (*src)[5]; tem[6] = (*src)[6]; tem[7] = (*src)[7]; dst[0] = *(v8hi *)tem; } void foo (v8hi *dst, v16qi src) { unsigned short tem[8]; tem[0] = src[0]; tem[1] = src[1]; tem[2] = src[2]; tem[3] = src[3]; tem[4] = src[4]; tem[5] = src[5]; tem[6] = src[6]; tem[7] = src[7]; dst[0] = *(v8hi *)tem; }
It looks like there are already some avx512 patterns matching this but they are not visible to the RTL expanders. (define_insn "zero_extendv8qiv8hi2" [(set (match_operand:V8HI 0 "register_operand" "=x,v") (zero_extend:V8HI (match_operand:V8QI 1 "register_operand" "x,v")))] "TARGET_SSE2" "@ punpcklbw\t{%1, %0|%0, %1} vpunpcklbw\t{%1, %0|%0, %1}" ) does it for the testcase pasted but truncation and other extends are also missing, likewise for floats.
Created attachment 47924 [details] Prototype patch Prototype patch that introduces missing expanders to generate PMOVZX and PMOVSX packed moves. For the testcase in Comment #0, we generate (-O3 -mavx): bar: vpmovzxbw (%rsi), %xmm0 vmovdqa %xmm0, (%rdi) ret foo: vpmovzxbw %xmm0, %xmm0 vmovdqa %xmm0, (%rdi) ret
Richi, should the following test also vectorize? --cut here-- typedef unsigned char v16qi __attribute__((vector_size (16))); typedef unsigned int v4si __attribute__((vector_size (16))); void foo_u8_u32 (v4si * dst, v16qi * __restrict src) { unsigned int tem[4]; tem[0] = (*src)[0]; tem[1] = (*src)[1]; tem[2] = (*src)[2]; tem[3] = (*src)[3]; dst[0] = *(v4si *) tem; } void bar_u8_u32 (v4si * dst, v16qi src) { unsigned int tem[4]; tem[0] = src[0]; tem[1] = src[1]; tem[2] = src[2]; tem[3] = src[3]; dst[0] = *(v4si *) tem; } --cut here--
(In reply to Uroš Bizjak from comment #3) > Richi, should the following test also vectorize? It doesn't vectorize because supportable_convert_operation returns false for: (gdb) p debug_generic_expr (vectype_out) vector(4) unsigned int $1 = void (gdb) p debug_generic_expr (vectype_in) vector(4) unsigned char $2 = void and TYPE_MODE for "vector(4) unsigned char" returns E_SImode: (gdb) p m1 $3 = E_V4SImode (gdb) p m2 $4 = E_SImode and: if (!VECTOR_MODE_P (m1) || !VECTOR_MODE_P (m2)) return false;
Created attachment 47927 [details] Prototype patch v2 A couple of typos fixed. Still doesn't vectorize v4qi->v4si, v2qi->v2di, v2hi->v2di and v4qi->v4di. Looks like tree-opt issue, as explained in Comment #4.
Created attachment 47928 [details] Test cases sse4, avx2 and avx512bw test cases. Fails: FAIL: gcc.target/i386/pr92658-avx2.c scan-assembler-times pmovzxbq 2 FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxbd 2 FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxbq 2 FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxwq 2
CC author of the generic functionality part.
On Fri, 28 Feb 2020, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658 > > --- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> --- > Richi, should the following test also vectorize? In priciple yes. I see both cases "vectorized" to a store from a CTOR but then my primitive pattern matching in forwprop not applying because supportable_convert_operation is confused about a vector(4) char type having SImode and bailing out at the 292 if (!VECTOR_MODE_P (m1) || !VECTOR_MODE_P (m2)) 293 return false; check. So it looks like with just SSE2 the backend doesn't consider V4QImode a supported vector type. I'm not sure we want to fix that but then this means regular optab checks won't do it here. Interesting cases nevertheless. Would those conversions map to good assembly? > --cut here-- > typedef unsigned char v16qi __attribute__((vector_size (16))); > typedef unsigned int v4si __attribute__((vector_size (16))); > > void > foo_u8_u32 (v4si * dst, v16qi * __restrict src) > { > unsigned int tem[4]; > tem[0] = (*src)[0]; > tem[1] = (*src)[1]; > tem[2] = (*src)[2]; > tem[3] = (*src)[3]; > dst[0] = *(v4si *) tem; > } > > void > bar_u8_u32 (v4si * dst, v16qi src) > { > unsigned int tem[4]; > tem[0] = src[0]; > tem[1] = src[1]; > tem[2] = src[2]; > tem[3] = src[3]; > dst[0] = *(v4si *) tem; > } > --cut here--
(In reply to rguenther@suse.de from comment #8) > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658 > > > > --- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> --- > > Richi, should the following test also vectorize? > > In priciple yes. I see both cases "vectorized" to a store > from a CTOR but then my primitive pattern matching in forwprop > not applying because supportable_convert_operation is confused > about a vector(4) char type having SImode and bailing out at the > > 292 if (!VECTOR_MODE_P (m1) || !VECTOR_MODE_P (m2)) > 293 return false; > > check. So it looks like with just SSE2 the backend doesn't > consider V4QImode a supported vector type. I'm not sure we > want to fix that but then this means regular optab checks > won't do it here. > > Interesting cases nevertheless. > > Would those conversions map to good assembly? Yes, please see attached test cases, especially pr92658-sse4.c. We have pmovzxbd for v4qi->v4si, pmovzxbq for v2qi->v2di, and pmovzxwd for v2hi->v2di, in addition to AVX2 pmovzxbq for v4qi->v4di. All testcases should vectorize with a single instruction. It is also possible to introduce additional conversions to 64bit vectors (e.g. v2qi->v2si) for TARGET_MMX_WITH_SSE targets. Please note that there are similar sign_extend patterns and corresponding truncate patterns with AVX512F, currently not covered by attached patch and testcases.
The patch is ready to be pushed, it is waiting for a decision what to do with failed cases. Richi, should this patch move forward (eventually XFAILing failed cases), or do you plan to look at the fails from the generic vectorizer POV?
On Thu, 14 May 2020, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658 > > --- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> --- > The patch is ready to be pushed, it is waiting for a decision what to do with > failed cases. > > Richi, should this patch move forward (eventually XFAILing failed cases), or do > you plan to look at the fails from the generic vectorizer POV? I think we should go forward with the patch, either XFAILing the testcases or splitting out the testcase (and backend patterns that do not get exercised due to the issue). I've already said in comment#8 that the issue here is optabs working with modes and not vector types, so it's a bit hard to use the same mechanism to deal with the currently failing cases. One possible route would be to add V4QImode similar to how we now do V2SFmode with SSE but of course where do we stop ... As said we can try to tackle this incrementally. Maybe Richard also has input here?
(In reply to rguenther@suse.de from comment #11) > On Thu, 14 May 2020, ubizjak at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658 > > > > --- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> --- > > The patch is ready to be pushed, it is waiting for a decision what to do with > > failed cases. > > > > Richi, should this patch move forward (eventually XFAILing failed cases), or do > > you plan to look at the fails from the generic vectorizer POV? > > I think we should go forward with the patch, either XFAILing the testcases > or splitting out the testcase (and backend patterns that do not get > exercised due to the issue). > > I've already said in comment#8 that the issue here is optabs working > with modes and not vector types, so it's a bit hard to use the > same mechanism to deal with the currently failing cases. One > possible route would be to add V4QImode similar to how we now > do V2SFmode with SSE but of course where do we stop ... > > As said we can try to tackle this incrementally. Maybe Richard also > has input here? Nothing to add really, but: yeah, the idea was that the target would provide smaller vector modes if it can efficiently load, store and (at least) add them. I think it would be good to do that for aarch64 Advanced SIMD at some point. There are three points behind using vector modes for this: 1. It gives the target the flexibility to decide how best to implement the operation (All at one end, and if so, which end? Or spread out evenly across the vector?) On aarch64, the choice would differ between SVE and Advanced SIMD. 2. It means that __attribute__((vector_size)) vectors benefit. 3. The smaller modes can be used in conjunction with autovectorize_vector_modes to give the loop vectoriser a choice between operating on packed or unpacked vectors. This is mostly useful for VECT_COMPARE_COSTS but can help with low-trip-count loops even without. So yeah, defining V4QI sounds like the way to go if the architecture can process it efficiently. I get the "where do we stop" concern, but .md mode iterators should help.
*** Bug 92611 has been marked as a duplicate of this bug. ***
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>: https://gcc.gnu.org/g:f6e40195ec3d3b402a5f6c58dbf359479bc4cbfa commit r11-485-gf6e40195ec3d3b402a5f6c58dbf359479bc4cbfa Author: Uros Bizjak <ubizjak@gmail.com> Date: Tue May 19 11:25:46 2020 +0200 i386: Add missing vector zero/sign extend expanders [PR92658] 2020-05-19 Uroš Bizjak <ubizjak@gmail.com> gcc/ChangeLog: PR target/92658 * config/i386/sse.md (<code>v16qiv16hi2): New expander. (<code>v32qiv32hi2): Ditto. (<code>v8qiv8hi2): Ditto. (<code>v16qiv16si2): Ditto. (<code>v8qiv8si2): Ditto. (<code>v4qiv4si2): Ditto. (<code>v16hiv16si2): Ditto. (<code>v8hiv8si2): Ditto. (<code>v4hiv4si2): Ditto. (<code>v8qiv8di2): Ditto. (<code>v4qiv4di2): Ditto. (<code>v2qiv2di2): Ditto. (<code>v8hiv8di2): Ditto. (<code>v4hiv4di2): Ditto. (<code>v2hiv2di2): Ditto. (<code>v8siv8di2): Ditto. (<code>v4siv4di2): Ditto. (<code>v2siv2di2): Ditto. gcc/testsuite/ChangeLog: PR target/92658 * gcc.target/i386/pr92658-sse4.c: New test. * gcc.target/i386/pr92658-avx2.c: New test. * gcc.target/i386/pr92658-avx512bw.c: New test.
I will leave truncations (Down Converts in Intel speak) which are AVX512F instructions to someone else. It should be easy to add missing patterns and tests following the example of committed patch.
(In reply to Uroš Bizjak from comment #15) > I will leave truncations (Down Converts in Intel speak) which are AVX512F > instructions to someone else. It should be easy to add missing patterns and > tests following the example of committed patch. I'll take a look.
Created attachment 48570 [details] 0001-Add-missing-vector-truncmn2-expanders-PR92658.patch Seems there're only truncmn2 for truncate, not expander for us_truncate and ss_truncate, am i missing?
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:e740f3d73144abbca1ad98a04825c6bd63314a0b commit r11-571-ge740f3d73144abbca1ad98a04825c6bd63314a0b Author: liuhongt <hongtao.liu@intel.com> Date: Wed May 20 15:53:14 2020 +0800 Add missing vector truncmn2 expanders [PR92658] 2020-05-22 Hongtao.liu <hongtao.liu@intel.com> gcc/ChangeLog: PR target/92658 * config/i386/sse.md (trunc<pmov_src_lower><mode>2): New expander (truncv32hiv32qi2): Ditto. (trunc<ssedoublemodelower><mode>2): Ditto. (trunc<mode><pmov_dst_3>2): Ditto. (trunc<mode><pmov_dst_mode_4>2): Ditto. (truncv2div2si2): Ditto. (truncv8div8qi2): Ditto. (avx512f_<code>v8div16qi2): Renaming from *avx512f_<code>v8div16qi2. (avx512vl_<code>v2div2si): Renaming from *avx512vl_<code>v2div2si2. (avx512vl_<code><mode>v2<ssecakarnum>qi2): Renaming from *avx512vl_<code><mode>v<ssescalarnum>qi2. gcc/testsuite/ChangeLog: * gcc.target/i386/pr92658-avx512f.c: New test. * gcc.target/i386/pr92658-avx512vl.c: Ditto. * gcc.target/i386/pr92658-avx512bw-trunc.c: Ditto.
(In reply to CVS Commits from comment #18) > gcc/testsuite/ChangeLog: > * gcc.target/i386/pr92658-avx512f.c: New test. > * gcc.target/i386/pr92658-avx512vl.c: Ditto. > * gcc.target/i386/pr92658-avx512bw-trunc.c: Ditto. Note that the second one as committed has an extra closing brace which causes an error: ERROR: gcc.target/i386/pr92658-avx512vl.c: unknown dg option: \} for "}" diff --git a/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c b/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c index 50b32f968ac3..dc50084119b5 100644 --- a/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c +++ b/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c @@ -121,7 +121,7 @@ truncdb_128 (v16qi * dst, v4si * __restrict src) dst[0] = *(v16qi *) tem; } -/* { dg-final { scan-assembler-times "vpmovqd" 2 } } } */ +/* { dg-final { scan-assembler-times "vpmovqd" 2 } } */ /* { dg-final { scan-assembler-times "vpmovqw" 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times "vpmovqb" 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times "vpmovdw" 1 } } */
(In reply to Mark Wielaard from comment #19) > (In reply to CVS Commits from comment #18) > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr92658-avx512f.c: New test. > > * gcc.target/i386/pr92658-avx512vl.c: Ditto. > > * gcc.target/i386/pr92658-avx512bw-trunc.c: Ditto. > > Note that the second one as committed has an extra closing brace which > causes an error: > > ERROR: gcc.target/i386/pr92658-avx512vl.c: unknown dg option: \} for "}" > > diff --git a/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c > b/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c > index 50b32f968ac3..dc50084119b5 100644 > --- a/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c > +++ b/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c > @@ -121,7 +121,7 @@ truncdb_128 (v16qi * dst, v4si * __restrict src) > dst[0] = *(v16qi *) tem; > } > > -/* { dg-final { scan-assembler-times "vpmovqd" 2 } } } */ > +/* { dg-final { scan-assembler-times "vpmovqd" 2 } } */ > /* { dg-final { scan-assembler-times "vpmovqw" 2 { xfail *-*-* } } } */ > /* { dg-final { scan-assembler-times "vpmovqb" 2 { xfail *-*-* } } } */ > /* { dg-final { scan-assembler-times "vpmovdw" 1 } } */ Oh, sorry for typo.
(In reply to CVS Commits from comment #14) > The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>: > > https://gcc.gnu.org/g:f6e40195ec3d3b402a5f6c58dbf359479bc4cbfa > > commit r11-485-gf6e40195ec3d3b402a5f6c58dbf359479bc4cbfa > Author: Uros Bizjak <ubizjak@gmail.com> > Date: Tue May 19 11:25:46 2020 +0200 > > i386: Add missing vector zero/sign extend expanders [PR92658] > > 2020-05-19 Uroš Bizjak <ubizjak@gmail.com> > > gcc/ChangeLog: > PR target/92658 > * config/i386/sse.md (<code>v16qiv16hi2): New expander. > (<code>v32qiv32hi2): Ditto. > (<code>v8qiv8hi2): Ditto. > (<code>v16qiv16si2): Ditto. > (<code>v8qiv8si2): Ditto. > (<code>v4qiv4si2): Ditto. > (<code>v16hiv16si2): Ditto. > (<code>v8hiv8si2): Ditto. > (<code>v4hiv4si2): Ditto. > (<code>v8qiv8di2): Ditto. > (<code>v4qiv4di2): Ditto. > (<code>v2qiv2di2): Ditto. > (<code>v8hiv8di2): Ditto. > (<code>v4hiv4di2): Ditto. > (<code>v2hiv2di2): Ditto. > (<code>v8siv8di2): Ditto. > (<code>v4siv4di2): Ditto. > (<code>v2siv2di2): Ditto. > <code> for sign_extend is sign_extend, but the standard pattern names for sign_extend is extendmn. We need to to refine those expanders Add define_code_attr like aarch64/iterators.md? ---------- ;; Map rtl objects to optab names (define_code_attr optab [(ashift "ashl") (ashiftrt "ashr") (lshiftrt "lshr") (rotatert "rotr") (sign_extend "extend") (zero_extend "zero_extend") ---------- > gcc/testsuite/ChangeLog: > PR target/92658 > * gcc.target/i386/pr92658-sse4.c: New test. > * gcc.target/i386/pr92658-avx2.c: New test. > * gcc.target/i386/pr92658-avx512bw.c: New test.
(In reply to Hongtao.liu from comment #21) > Add define_code_attr like aarch64/iterators.md? > > ---------- > ;; Map rtl objects to optab names > (define_code_attr optab [(ashift "ashl") > (ashiftrt "ashr") > (lshiftrt "lshr") > (rotatert "rotr") > (sign_extend "extend") > (zero_extend "zero_extend") Yes. Please go ahead, the patch is preapproved.
(In reply to Uroš Bizjak from comment #22) > (In reply to Hongtao.liu from comment #21) > > Add define_code_attr like aarch64/iterators.md? > > > > ---------- > > ;; Map rtl objects to optab names > > (define_code_attr optab [(ashift "ashl") > > (ashiftrt "ashr") > > (lshiftrt "lshr") > > (rotatert "rotr") > > (sign_extend "extend") > > (zero_extend "zero_extend") > > Yes. Please go ahead, the patch is preapproved. Fixed by r11-6351
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:1b5669752426d225b0088d57d1d2fffba9625032 commit r11-6514-g1b5669752426d225b0088d57d1d2fffba9625032 Author: Hongyu Wang <hongyu.wang@intel.com> Date: Tue Dec 29 15:14:09 2020 +0800 Adjust testcase for PR 92658 gcc/testsuite/ChangeLog: * gcc.target/i386/pr92658-avx512bw.c: Add -mprefer-vector-width=512 to avoid impact of different default mtune which gcc is built with. * gcc.target/i386/pr92658-avx512bw-2.c: Ditto.
(In reply to Uroš Bizjak from comment #5) > Created attachment 47927 [details] > Prototype patch v2 > > A couple of typos fixed. > > Still doesn't vectorize v4qi->v4si, v2qi->v2di, v2hi->v2di and v4qi->v4di. > > Looks like tree-opt issue, as explained in Comment #4. v4qi->v4si, v2hi->v2di and v4qi->v4di are vectorized, but v2qi->v2di is not, guess it's related to 16-bit vector support.
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>: https://gcc.gnu.org/g:608e7f3ab47fe746279c552c3574147aa3d8ee76 commit r14-666-g608e7f3ab47fe746279c552c3574147aa3d8ee76 Author: Uros Bizjak <ubizjak@gmail.com> Date: Wed May 10 22:40:53 2023 +0200 i386: Add missing vector extend patterns [PR92658] Add missing insn pattern for v2qi -> v2si vector extend and named expanders to activate generation of vector extends to 8-byte and 4-byte vectors. gcc/ChangeLog: PR target/92658 * config/i386/mmx.md (sse4_1_<code>v2qiv2si2): New insn pattern. (<insn>v4qiv4hi2): New expander. (<insn>v2hiv2si2): Ditto. (<insn>v2qiv2si2): Ditto. (<insn>v2qiv2hi2): Ditto. gcc/testsuite/ChangeLog: PR target/92658 * gcc.target/i386/pr92658-sse4-4b.c: New test. * gcc.target/i386/pr92658-sse4-8b.c: New test.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:49337040865269e13cdc2ead276d12ecb2e9f606 commit r14-1509-g49337040865269e13cdc2ead276d12ecb2e9f606 Author: liuhongt <hongtao.liu@intel.com> Date: Thu Jun 1 15:08:02 2023 +0800 i386: Add missing vector truncate patterns [PR92658]. Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector truncate. gcc/ChangeLog: PR target/92658 * config/i386/mmx.md (truncv2hiv2qi2): New define_insn. (truncv2si<mode>2): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr92658-avx512bw-trunc-2.c: New test.