Bug 115833 - SLP of signed short multiply goes wrong
Summary: SLP of signed short multiply goes wrong
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-07-09 05:01 UTC by Andrew Pinski
Modified: 2025-03-10 06:26 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-07-09 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2024-07-09 05:01:24 UTC
Take:
```
void v4hi_smul(signed short *t, signed short tt)
{
  t[0] *= tt;
  t[1] *= tt;
  t[2] *= tt;
  t[3] *= tt;
}


void v4hi_umul(unsigned short *t, unsigned short tt)
{
  t[0] *= tt;
  t[1] *= tt;
  t[2] *= tt;
  t[3] *= tt;
}
```

At `-O2 -fno-vect-cost-model ` on x86_64, v4hi_smul code generation is really bad while v4hi_umul is decent.

Both functions should produce the same code but currently we don't.

I noticed this while adding V2HI support to aarch64 backend and writing up testcases.
Comment 1 Richard Biener 2024-07-09 08:07:58 UTC
It's because we discover

t.c:3:8: note:   node 0x6624850 (max_nunits=4, refcnt=2) vector(4) unsigned short
t.c:3:8: note:   op template: _4 = _2 * tt.0_3;
t.c:3:8: note:          stmt 0 _4 = _2 * tt.0_3;
t.c:3:8: note:          stmt 1 _8 = tt.0_3 * _7;
t.c:3:8: note:          stmt 2 _12 = tt.0_3 * _11;
t.c:3:8: note:          stmt 3 _16 = tt.0_3 * _15;
t.c:3:8: note:          children 0x66248e0 0x6624a00
t.c:3:8: note:   node 0x66248e0 (max_nunits=4, refcnt=2) vector(4) unsigned short

and the comment already says

          if (is_a <bb_vec_info> (vinfo)
              && !oprnd_info->any_pattern)
            {
              /* Now for commutative ops we should see whether we can
                 make the other operand matching.  */
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "treating operand as external\n");
              oprnd_info->first_dt = dt = vect_external_def;
            }
          else
            {
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "Build SLP failed: different types\n");
              return 1;
            }

going the loop-SLP way in this case fixes it but we should probably pass down
a hint whether the parent operation could swap (but we don't know whether
that would succeed then).

This then produces

v4hi_smul:
.LFB0:
        .cfi_startproc
        movq    (%rdi), %xmm0
        movd    %esi, %xmm2
        pshuflw $0, %xmm2, %xmm1
        pmullw  %xmm1, %xmm0
        movq    %xmm0, (%rdi)
        ret

note this is all heuristics and shows the difficulty of a greedy search
(without exploring the full search space).

Note costing even favors the "wrong" vectorization, but only slightly:

t.c:3:8: note: Cost model analysis:
_5 1 times scalar_store costs 12 in body
_9 1 times scalar_store costs 12 in body
_13 1 times scalar_store costs 12 in body
_17 1 times scalar_store costs 12 in body
_2 * tt.0_3 1 times scalar_stmt costs 16 in body
tt.0_3 * _7 1 times scalar_stmt costs 16 in body
tt.0_3 * _11 1 times scalar_stmt costs 16 in body
tt.0_3 * _15 1 times scalar_stmt costs 16 in body
node 0x59e9970 1 times vec_construct costs 18 in prologue
node 0x59e9a90 1 times vec_construct costs 18 in prologue
_2 * tt.0_3 1 times vector_stmt costs 16 in body
_5 1 times unaligned_store (misalign -1) costs 12 in body
t.c:3:8: note: Cost model analysis for part in loop 0:
  Vector cost: 64
  Scalar cost: 112
t.c:3:8: note: Basic block will be vectorized using SLP
t.c:3:8: optimized: basic block part vectorized using 8 byte vectors

It seems the very bad code generation is mostly from constructing the
V4HImode vectors going via GPRs with shifts and ORs.  Possibly
constructing a V4SImode vector and then packing to V4HImode would be
better?
Comment 2 Richard Biener 2024-07-09 08:17:21 UTC
typedef unsigned short v4hi __attribute__((vector_size(8)));
typedef unsigned int v4si __attribute__((vector_size(16)));

v4hi foo (unsigned short a, unsigned short b, unsigned short c, unsigned short d)
{
  return (v4hi){a, b, c, d};
}

v4hi bar (unsigned short a, unsigned short b, unsigned short c, unsigned short d)
{
  return __builtin_convertvector ((v4si){a, b, c, d}, v4hi);
}

maybe not:

foo:
.LFB0:
        .cfi_startproc
        movzwl  %cx, %ecx
        movzwl  %dx, %edx
        movzwl  %si, %esi
        movzwl  %di, %edi
        salq    $16, %rcx
        orq     %rdx, %rcx
        salq    $16, %rcx
        orq     %rsi, %rcx
        salq    $16, %rcx
        orq     %rdi, %rcx
        movq    %rcx, %xmm0
        ret

bar:
.LFB1:
        .cfi_startproc
        movzwl  %di, %eax
        movzwl  %si, %esi
        movzwl  %dx, %edx
        movzwl  %cx, %ecx
        movd    %eax, %xmm0
        movd    %edx, %xmm1
        movd    %ecx, %xmm3
        movd    %esi, %xmm4
        punpckldq       %xmm3, %xmm1
        pxor    %xmm2, %xmm2
        punpckldq       %xmm4, %xmm0
        punpcklqdq      %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        movdqa  %xmm0, %xmm2
        punpckhwd       %xmm1, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        ret

though bar() looks like I expected in .optimized:

  <bb 2> [local count: 1073741824]:
  _1 = (unsigned int) a_5(D);
  _2 = (unsigned int) b_6(D);
  _3 = (unsigned int) c_7(D);
  _4 = (unsigned int) d_8(D);
  _9 = {_1, _2, _3, _4};
  _12 = VEC_PACK_TRUNC_EXPR <_9, { 0, 0, 0, 0 }>;
  _13 = BIT_FIELD_REF <_12, 64, 0>;
  return _13;

it's a little bit better with SSE4:

bar:
.LFB1:
        .cfi_startproc
        movzwl  %di, %eax
        movzwl  %dx, %edx
        movzwl  %si, %esi
        movzwl  %cx, %ecx
        movd    %eax, %xmm1
        movd    %edx, %xmm0
        pinsrd  $1, %ecx, %xmm0
        pinsrd  $1, %esi, %xmm1
        punpcklqdq      %xmm0, %xmm1
        pxor    %xmm0, %xmm0
        pblendw $85, %xmm1, %xmm0
        pxor    %xmm1, %xmm1
        packusdw        %xmm1, %xmm0
        ret

but

        pxor    %xmm0, %xmm0
        pblendw $85, %xmm1, %xmm0
        pxor    %xmm1, %xmm1
        packusdw        %xmm1, %xmm0

is a bit odd for the packing.  Possibly the target lacks a truncv4siv4hi
operation (thus the explicit zero vector).  Possibly x86 lacks a
pack-lowpart/pack-highpart insn.
Comment 3 Hongtao Liu 2024-07-09 08:46:58 UTC
> It seems the very bad code generation is mostly from constructing the
> V4HImode vectors going via GPRs with shifts and ORs.  Possibly
> constructing a V4SImode vector and then packing to V4HImode would be
> better?

void v4hi_contruct(signed short *t, signed short tt, short tt1)
{
  t[0] = tt;
  t[1] = tt1;
  t[2] = tt1;
  t[3] = tt1;
}


void v4si_contruct(int *t, int tt, int tt2)
{
  t[0] = tt;
  t[1] = tt2;
  t[2] = tt2;
  t[3] = tt2;
}

v4hi_contruct(short*, short, short):
        movzx   eax, dx
        movzx   esi, si
        mov     rdx, rax
        sal     rdx, 16
        or      rdx, rax
        sal     rdx, 16
        or      rdx, rax
        sal     rdx, 16
        or      rdx, rsi
        mov     QWORD PTR [rdi], rdx
        ret
v4si_contruct(int*, int, int):
        vmovd   xmm2, edx
        vmovd   xmm3, esi
        vpinsrd xmm1, xmm2, edx, 1
        vpinsrd xmm0, xmm3, edx, 1
        vpunpcklqdq     xmm0, xmm0, xmm1
        vmovdqu XMMWORD PTR [rdi], xmm0
        ret

both vmovd and vpinsrd is expensive, and v4hi_contruct is not necessary worse than v4si_construct, but v4hi_construct can be optimized to be a little more parallel via GPRs.
Comment 4 Hongtao Liu 2024-07-09 09:03:01 UTC
> is a bit odd for the packing.  Possibly the target lacks a truncv4siv4hi
> operation (thus the explicit zero vector).  Possibly x86 lacks a
> pack-lowpart/pack-highpart insn.

We support truncv4siv4hi2 under AVX2, w/o AVX512, it generates shufb.

15390(define_expand "trunc<mode><pmov_dst_4_lower>2"
15391  [(set (match_operand:<pmov_dst_4> 0 "register_operand")
15392        (truncate:<pmov_dst_4>
15393          (match_operand:PMOV_SRC_MODE_4 1 "register_operand")))]
15394  "TARGET_AVX2"
15395{


bar(unsigned int __vector(4)):
        vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
        ret

w/o AVX2, it's lower to 

  _12 = VEC_PACK_TRUNC_EXPR <_9, { 0, 0, 0, 0 }>;
  _13 = BIT_FIELD_REF <_12, 64, 0>;

vec_pack_trunc_expr uses packusdw with upper 16-bit cleared.

The optab can be extended to TARGET_SSSE3 which supports pshufb.
Comment 5 Andrew Pinski 2024-07-09 18:16:48 UTC
>It seems the very bad code generation is mostly from constructing the
V4HImode vectors going via GPRs with shifts and ORs.

On x86_64 that is true but aarch64 that is definitely not true:

        fmov    s31, w1
        add     x1, x0, 2
        dup     v30.4h, v31.h[0]
        ld1     {v31.h}[1], [x1]
        add     x1, x0, 4
        ld1     {v31.h}[2], [x1]
        add     x1, x0, 6
        ld1     {v30.h}[0], [x0]
        ld1     {v31.h}[3], [x1]
        mul     v31.4h, v31.4h, v30.4h
        str     d31, [x0]

That is reasonible for `{t[0], tt, tt, tt}` (one dup, followed by one ld1) and `{tt, t[1], t[2], t[3]}` (one fmov followed by 3 ld1s) vector construction. But the whole thing could have been and should have been just:
```
        ldr     d31, [x0]
        fmov    s7, w1
        mul     v31.4h, v31.4h, v7.h[0]
        str     d31, [x0]
```

Note the dup is part of the mul instruction here.