Created attachment 42299 [details] vectslp.cpp The attached example is a simple matrix multiplication. With -O3 or -O2 -ftree-slp-vectorize the basic-block is not vectorized. Oddly, with -Os -ftree-slp-vectorize it is.
Created attachment 42300 [details] Assembler output with -O3
Created attachment 42301 [details] Assembler output with -Os -ftree-slp-vectorize
Note it appears the fact it can do it at all in -Os is new in gcc 7
Hmm, on aarch64 we do a decent job at vectorizing this (since GCC 11): ldp d4, d0, [x1] ldr d7, [x0, 16] ldp d6, d5, [x0] fmul v3.2s, v0.2s, v7.s[1] ldr d1, [x1, 16] fmul v2.2s, v0.2s, v6.s[1] fmul v0.2s, v0.2s, v5.s[1] fmla v3.2s, v4.2s, v7.s[0] fmla v2.2s, v4.2s, v6.s[0] fmla v0.2s, v4.2s, v5.s[0] fadd v1.2s, v1.2s, v3.2s stp d2, d0, [x8] str d1, [x8, 16] I suspect this is because V2SF does not exist on x86_64. Using -Dfloat=double seems to get better for x86_64 (with -mavx2): vmovupd (%rdx), %ymm0 vpermilpd $0, (%rsi), %ymm1 movq %rdi, %rax vmovsd 32(%rsi), %xmm5 vmovsd 40(%rsi), %xmm4 vpermpd $68, %ymm0, %ymm2 vpermpd $238, %ymm0, %ymm3 vmulpd %ymm2, %ymm1, %ymm2 vpermilpd $15, (%rsi), %ymm1 vmulpd %ymm3, %ymm1, %ymm1 vaddpd %ymm1, %ymm2, %ymm1 vmulsd %xmm5, %xmm0, %xmm2 vmovupd %ymm1, (%rdi) vmovapd %xmm0, %xmm1 vextractf128 $0x1, %ymm0, %xmm0 vmulsd %xmm4, %xmm0, %xmm3 vunpckhpd %xmm1, %xmm1, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vmulsd %xmm5, %xmm1, %xmm1 vmulsd %xmm4, %xmm0, %xmm0 vaddsd %xmm3, %xmm2, %xmm2 vaddsd 32(%rdx), %xmm2, %xmm2 vaddsd %xmm0, %xmm1, %xmm1 vaddsd 40(%rdx), %xmm1, %xmm1 vmovsd %xmm2, 32(%rdi) vmovsd %xmm1, 40(%rdi)
x86 actually does have V2SF, the issue is that there's an opportunity for V4SF vectorization and one for V2SF arriving at the same load groups and that causes a conflict (there's other PRs about this general issue), so we kill one part: t.C:18:12: missed: desired vector type conflicts with earlier one for _2 = b_35(D)->m11; t.C:18:12: note: removing SLP instance operations starting from: <retval>.dx = _27; also we have a bunch of live lanes off the remaining vectorized piece which makes code a bit awkward. Unfortunately we have no way to force 64bit vectors here (V2SF) to see whether splitting up the V4SFmode partition would help (I guess it would as can be seen from using 'double').
I have a patch that produces vect__1.5_42 = MEM <const vector(4) float> [(float *)a_34(D)]; vect__1.7_47 = VEC_PERM_EXPR <vect__1.5_42, vect__1.5_42, { 0, 0, 2, 2 }>; vect__2.10_49 = MEM <const vector(4) float> [(float *)b_35(D)]; vect__2.12_53 = VEC_PERM_EXPR <vect__2.10_49, vect__2.10_49, { 0, 1, 0, 1 }>; vect__3.13_54 = vect__1.7_47 * vect__2.12_53; vect__2.30_73 = MEM <const vector(2) float> [(float *)b_35(D)]; vect__1.18_61 = VEC_PERM_EXPR <vect__1.5_42, vect__1.5_42, { 1, 1, 3, 3 }>; vect__2.23_68 = VEC_PERM_EXPR <vect__2.10_49, vect__2.10_49, { 2, 3, 2, 3 }>; vect__6.24_69 = vect__1.18_61 * vect__2.23_68; vect__7.25_70 = vect__3.13_54 + vect__6.24_69; vect__5.40_85 = MEM <const vector(2) float> [(float *)b_35(D) + 8B]; MEM <vector(4) float> [(float *)&<retval>] = vect__7.25_70; vect__21.35_81 = MEM <const vector(2) float> [(float *)a_34(D) + 16B]; vect__1.36_82 = VEC_PERM_EXPR <vect__21.35_81, vect__21.35_81, { 0, 0 }>; vect__22.37_83 = vect__2.30_73 * vect__1.36_82; vect__1.46_94 = VEC_PERM_EXPR <vect__21.35_81, vect__21.35_81, { 1, 1 }>; vect__24.47_95 = vect__5.40_85 * vect__1.46_94; vect__25.48_96 = vect__22.37_83 + vect__24.47_95; vect__26.51_98 = MEM <const vector(2) float> [(float *)b_35(D) + 16B]; vect__27.52_100 = vect__25.48_96 + vect__26.51_98; MEM <vector(2) float> [(float *)&<retval> + 16B] = vect__27.52_100; that means it ends up with some odd vector loads, but with SSE 4.2 it becomes movups (%rsi), %xmm5 movups (%rdx), %xmm1 movq %rdi, %rax movq (%rdx), %xmm4 movq 8(%rdx), %xmm3 movsldup %xmm5, %xmm0 movaps %xmm1, %xmm2 movlhps %xmm1, %xmm2 shufps $238, %xmm1, %xmm1 mulps %xmm0, %xmm2 movshdup %xmm5, %xmm0 mulps %xmm1, %xmm0 movq 16(%rsi), %xmm1 addps %xmm2, %xmm0 movups %xmm0, (%rdi) movsldup %xmm1, %xmm0 movshdup %xmm1, %xmm1 mulps %xmm4, %xmm0 mulps %xmm3, %xmm1 addps %xmm1, %xmm0 movq 16(%rdx), %xmm1 addps %xmm1, %xmm0 movlps %xmm0, 16(%rdi) alternatively -mavx can do some of the required perms with the loads and with -mfma we can use an FMA as well: vpermilps $238, (%rdx), %xmm1 vpermilps $245, (%rsi), %xmm0 movq %rdi, %rax vpermilps $160, (%rsi), %xmm3 vpermilps $68, (%rdx), %xmm4 vmulps %xmm1, %xmm0, %xmm0 vmovq (%rdx), %xmm2 vfmadd231ps %xmm4, %xmm3, %xmm0 vmovq 8(%rdx), %xmm3 vmovups %xmm0, (%rdi) vmovq 16(%rsi), %xmm0 vmovsldup %xmm0, %xmm1 vmovshdup %xmm0, %xmm0 vmulps %xmm3, %xmm0, %xmm0 vfmadd132ps %xmm1, %xmm0, %xmm2 vmovq 16(%rdx), %xmm0 vaddps %xmm2, %xmm0, %xmm0 vmovlps %xmm0, 16(%rdi) I'm not sure whether the vmovups + vmovs{l,h}dup are any better than doing two scalar loads + dups though - it might avoid some STLF conflict with earlier smaller stores at least.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:6390c5047adb75960f86d56582e6322aaa4d9281 commit r12-3893-g6390c5047adb75960f86d56582e6322aaa4d9281 Author: Richard Biener <rguenther@suse.de> Date: Wed Nov 18 09:36:57 2020 +0100 Allow different vector types for stmt groups This allows vectorization (in practice non-loop vectorization) to have a stmt participate in different vector type vectorizations. It allows us to remove vect_update_shared_vectype and replace it by pushing/popping STMT_VINFO_VECTYPE from SLP_TREE_VECTYPE around vect_analyze_stmt and vect_transform_stmt. For data-ref the situation is a bit more complicated since we analyze alignment info with a specific vector type in mind which doesn't play well when that changes. So the bulk of the change is passing down the actual vector type used for a vectorized access to the various accessors of alignment info, first and foremost dr_misalignment but also aligned_access_p, known_alignment_for_access_p, vect_known_alignment_in_bytes and vect_supportable_dr_alignment. I took the liberty to replace ALL_CAPS macro accessors with the lower-case function invocations. The actual changes to the behavior are in dr_misalignment which now is the place factoring in the negative step adjustment as well as handling alignment queries for a vector type with bigger alignment requirements than what we can (or have) analyze(d). vect_slp_analyze_node_alignment makes use of this and upon receiving a vector type with a bigger alingment desire re-analyzes the DR with respect to it but keeps an older more precise result if possible. In this context it might be possible to do the analysis just once but instead of analyzing with respect to a specific desired alignment look for the biggest alignment we can compute a not unknown alignment. The ChangeLog includes the functional changes but not the bulk due to the alignment accessor API changes - I hope that's something good. 2021-09-17 Richard Biener <rguenther@suse.de> PR tree-optimization/97351 PR tree-optimization/97352 PR tree-optimization/82426 * tree-vectorizer.h (dr_misalignment): Add vector type argument. (aligned_access_p): Likewise. (known_alignment_for_access_p): Likewise. (vect_supportable_dr_alignment): Likewise. (vect_known_alignment_in_bytes): Likewise. Refactor. (DR_MISALIGNMENT): Remove. (vect_update_shared_vectype): Likewise. * tree-vect-data-refs.c (dr_misalignment): Refactor, handle a vector type with larger alignment requirement and apply the negative step adjustment here. (vect_calculate_target_alignment): Remove. (vect_compute_data_ref_alignment): Get explicit vector type argument, do not apply a negative step alignment adjustment here. (vect_slp_analyze_node_alignment): Re-analyze alignment when we re-visit the DR with a bigger desired alignment but keep more precise results from smaller alignments. * tree-vect-slp.c (vect_update_shared_vectype): Remove. (vect_slp_analyze_node_operations_1): Do not update the shared vector type on stmts. * tree-vect-stmts.c (vect_analyze_stmt): Push/pop the vector type of an SLP node to the representative stmt-info. (vect_transform_stmt): Likewise. * gcc.target/i386/vect-pr82426.c: New testcase. * gcc.target/i386/vect-pr97352.c: Likewise.
Fixed for GCC 12.