Take: ``` void v4hi_smul(signed short *t, signed short tt) { t[0] *= tt; t[1] *= tt; t[2] *= tt; t[3] *= tt; } void v4hi_umul(unsigned short *t, unsigned short tt) { t[0] *= tt; t[1] *= tt; t[2] *= tt; t[3] *= tt; } ``` At `-O2 -fno-vect-cost-model ` on x86_64, v4hi_smul code generation is really bad while v4hi_umul is decent. Both functions should produce the same code but currently we don't. I noticed this while adding V2HI support to aarch64 backend and writing up testcases.
It's because we discover t.c:3:8: note: node 0x6624850 (max_nunits=4, refcnt=2) vector(4) unsigned short t.c:3:8: note: op template: _4 = _2 * tt.0_3; t.c:3:8: note: stmt 0 _4 = _2 * tt.0_3; t.c:3:8: note: stmt 1 _8 = tt.0_3 * _7; t.c:3:8: note: stmt 2 _12 = tt.0_3 * _11; t.c:3:8: note: stmt 3 _16 = tt.0_3 * _15; t.c:3:8: note: children 0x66248e0 0x6624a00 t.c:3:8: note: node 0x66248e0 (max_nunits=4, refcnt=2) vector(4) unsigned short and the comment already says if (is_a <bb_vec_info> (vinfo) && !oprnd_info->any_pattern) { /* Now for commutative ops we should see whether we can make the other operand matching. */ if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "treating operand as external\n"); oprnd_info->first_dt = dt = vect_external_def; } else { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "Build SLP failed: different types\n"); return 1; } going the loop-SLP way in this case fixes it but we should probably pass down a hint whether the parent operation could swap (but we don't know whether that would succeed then). This then produces v4hi_smul: .LFB0: .cfi_startproc movq (%rdi), %xmm0 movd %esi, %xmm2 pshuflw $0, %xmm2, %xmm1 pmullw %xmm1, %xmm0 movq %xmm0, (%rdi) ret note this is all heuristics and shows the difficulty of a greedy search (without exploring the full search space). Note costing even favors the "wrong" vectorization, but only slightly: t.c:3:8: note: Cost model analysis: _5 1 times scalar_store costs 12 in body _9 1 times scalar_store costs 12 in body _13 1 times scalar_store costs 12 in body _17 1 times scalar_store costs 12 in body _2 * tt.0_3 1 times scalar_stmt costs 16 in body tt.0_3 * _7 1 times scalar_stmt costs 16 in body tt.0_3 * _11 1 times scalar_stmt costs 16 in body tt.0_3 * _15 1 times scalar_stmt costs 16 in body node 0x59e9970 1 times vec_construct costs 18 in prologue node 0x59e9a90 1 times vec_construct costs 18 in prologue _2 * tt.0_3 1 times vector_stmt costs 16 in body _5 1 times unaligned_store (misalign -1) costs 12 in body t.c:3:8: note: Cost model analysis for part in loop 0: Vector cost: 64 Scalar cost: 112 t.c:3:8: note: Basic block will be vectorized using SLP t.c:3:8: optimized: basic block part vectorized using 8 byte vectors It seems the very bad code generation is mostly from constructing the V4HImode vectors going via GPRs with shifts and ORs. Possibly constructing a V4SImode vector and then packing to V4HImode would be better?
typedef unsigned short v4hi __attribute__((vector_size(8))); typedef unsigned int v4si __attribute__((vector_size(16))); v4hi foo (unsigned short a, unsigned short b, unsigned short c, unsigned short d) { return (v4hi){a, b, c, d}; } v4hi bar (unsigned short a, unsigned short b, unsigned short c, unsigned short d) { return __builtin_convertvector ((v4si){a, b, c, d}, v4hi); } maybe not: foo: .LFB0: .cfi_startproc movzwl %cx, %ecx movzwl %dx, %edx movzwl %si, %esi movzwl %di, %edi salq $16, %rcx orq %rdx, %rcx salq $16, %rcx orq %rsi, %rcx salq $16, %rcx orq %rdi, %rcx movq %rcx, %xmm0 ret bar: .LFB1: .cfi_startproc movzwl %di, %eax movzwl %si, %esi movzwl %dx, %edx movzwl %cx, %ecx movd %eax, %xmm0 movd %edx, %xmm1 movd %ecx, %xmm3 movd %esi, %xmm4 punpckldq %xmm3, %xmm1 pxor %xmm2, %xmm2 punpckldq %xmm4, %xmm0 punpcklqdq %xmm1, %xmm0 movdqa %xmm0, %xmm1 punpcklwd %xmm2, %xmm0 punpckhwd %xmm2, %xmm1 movdqa %xmm0, %xmm2 punpckhwd %xmm1, %xmm2 punpcklwd %xmm1, %xmm0 punpcklwd %xmm2, %xmm0 ret though bar() looks like I expected in .optimized: <bb 2> [local count: 1073741824]: _1 = (unsigned int) a_5(D); _2 = (unsigned int) b_6(D); _3 = (unsigned int) c_7(D); _4 = (unsigned int) d_8(D); _9 = {_1, _2, _3, _4}; _12 = VEC_PACK_TRUNC_EXPR <_9, { 0, 0, 0, 0 }>; _13 = BIT_FIELD_REF <_12, 64, 0>; return _13; it's a little bit better with SSE4: bar: .LFB1: .cfi_startproc movzwl %di, %eax movzwl %dx, %edx movzwl %si, %esi movzwl %cx, %ecx movd %eax, %xmm1 movd %edx, %xmm0 pinsrd $1, %ecx, %xmm0 pinsrd $1, %esi, %xmm1 punpcklqdq %xmm0, %xmm1 pxor %xmm0, %xmm0 pblendw $85, %xmm1, %xmm0 pxor %xmm1, %xmm1 packusdw %xmm1, %xmm0 ret but pxor %xmm0, %xmm0 pblendw $85, %xmm1, %xmm0 pxor %xmm1, %xmm1 packusdw %xmm1, %xmm0 is a bit odd for the packing. Possibly the target lacks a truncv4siv4hi operation (thus the explicit zero vector). Possibly x86 lacks a pack-lowpart/pack-highpart insn.
> It seems the very bad code generation is mostly from constructing the > V4HImode vectors going via GPRs with shifts and ORs. Possibly > constructing a V4SImode vector and then packing to V4HImode would be > better? void v4hi_contruct(signed short *t, signed short tt, short tt1) { t[0] = tt; t[1] = tt1; t[2] = tt1; t[3] = tt1; } void v4si_contruct(int *t, int tt, int tt2) { t[0] = tt; t[1] = tt2; t[2] = tt2; t[3] = tt2; } v4hi_contruct(short*, short, short): movzx eax, dx movzx esi, si mov rdx, rax sal rdx, 16 or rdx, rax sal rdx, 16 or rdx, rax sal rdx, 16 or rdx, rsi mov QWORD PTR [rdi], rdx ret v4si_contruct(int*, int, int): vmovd xmm2, edx vmovd xmm3, esi vpinsrd xmm1, xmm2, edx, 1 vpinsrd xmm0, xmm3, edx, 1 vpunpcklqdq xmm0, xmm0, xmm1 vmovdqu XMMWORD PTR [rdi], xmm0 ret both vmovd and vpinsrd is expensive, and v4hi_contruct is not necessary worse than v4si_construct, but v4hi_construct can be optimized to be a little more parallel via GPRs.
> is a bit odd for the packing. Possibly the target lacks a truncv4siv4hi > operation (thus the explicit zero vector). Possibly x86 lacks a > pack-lowpart/pack-highpart insn. We support truncv4siv4hi2 under AVX2, w/o AVX512, it generates shufb. 15390(define_expand "trunc<mode><pmov_dst_4_lower>2" 15391 [(set (match_operand:<pmov_dst_4> 0 "register_operand") 15392 (truncate:<pmov_dst_4> 15393 (match_operand:PMOV_SRC_MODE_4 1 "register_operand")))] 15394 "TARGET_AVX2" 15395{ bar(unsigned int __vector(4)): vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip] ret w/o AVX2, it's lower to _12 = VEC_PACK_TRUNC_EXPR <_9, { 0, 0, 0, 0 }>; _13 = BIT_FIELD_REF <_12, 64, 0>; vec_pack_trunc_expr uses packusdw with upper 16-bit cleared. The optab can be extended to TARGET_SSSE3 which supports pshufb.
>It seems the very bad code generation is mostly from constructing the V4HImode vectors going via GPRs with shifts and ORs. On x86_64 that is true but aarch64 that is definitely not true: fmov s31, w1 add x1, x0, 2 dup v30.4h, v31.h[0] ld1 {v31.h}[1], [x1] add x1, x0, 4 ld1 {v31.h}[2], [x1] add x1, x0, 6 ld1 {v30.h}[0], [x0] ld1 {v31.h}[3], [x1] mul v31.4h, v31.4h, v30.4h str d31, [x0] That is reasonible for `{t[0], tt, tt, tt}` (one dup, followed by one ld1) and `{tt, t[1], t[2], t[3]}` (one fmov followed by 3 ld1s) vector construction. But the whole thing could have been and should have been just: ``` ldr d31, [x0] fmov s7, w1 mul v31.4h, v31.4h, v7.h[0] str d31, [x0] ``` Note the dup is part of the mul instruction here.