Bug 101668

Summary: BB vectorizer doesn't handle lowpart of existing vector
Product: gcc Reporter: Hongtao.liu <crazylht>
Component: tree-optimizationAssignee: Richard Biener <rguenth>
Status: RESOLVED FIXED    
Severity: normal CC: rguenth
Priority: P3 Keywords: missed-optimization
Version: 12.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2021-08-28 00:00:00
Bug Depends on:    
Bug Blocks: 53947    
Attachments: patch

Description Hongtao.liu 2021-07-29 01:47:47 UTC
cat test.c

typedef int v16si __attribute__((vector_size (64)));
typedef long long v8di __attribute__((vector_size (64)));

void
bar_s32_s64 (v8di * dst, v16si src)
{
  long long tem[8];
  tem[0] = src[0];
  tem[1] = src[1];
  tem[2] = src[2];
  tem[3] = src[3];
  tem[4] = src[4];
  tem[5] = src[5];
  tem[6] = src[6];
  tem[7] = src[7];
  dst[0] = *(v8di *) tem;
}

gcc -O3 -march=skylake-avx512 will fail to vectorize the case after my r12-2549 because i've increased vec_construct cost for SKX/CLX. Here's dump for slp2

  <bb 2> [local count: 1073741824]:
  _1 = BIT_FIELD_REF <src_18(D), 32, 0>;
  _2 = (long long int) _1;
  _3 = BIT_FIELD_REF <src_18(D), 32, 32>;
  _4 = (long long int) _3;
  _5 = BIT_FIELD_REF <src_18(D), 32, 64>;
  _6 = (long long int) _5;
  _7 = BIT_FIELD_REF <src_18(D), 32, 96>;
  _8 = (long long int) _7;
  _9 = BIT_FIELD_REF <src_18(D), 32, 128>;
  _10 = (long long int) _9;
  _11 = BIT_FIELD_REF <src_18(D), 32, 160>;
  _12 = (long long int) _11;
  _13 = BIT_FIELD_REF <src_18(D), 32, 192>;
  _14 = (long long int) _13;
  _15 = BIT_FIELD_REF <src_18(D), 32, 224>;
  _31 = {_1, _3, _5, _7, _9, _11, _13, _15};
  vect__2.4_32 = (vector(8) long long int) _31;
  _16 = (long long int) _15;
  MEM <vector(8) long long int> [(long long int *)&tem] = vect__2.4_32;
  _17 = MEM[(v8di *)&tem];
  *dst_28(D) = _17;
  tem ={v} {CLOBBER};
  return;

But actually, there's no need for vec_contruct from each element, it will be optimized to

   <bb 2> [local count: 1073741824]:
  _2 = BIT_FIELD_REF <src_18(D), 256, 0>;
  vect__2.4_32 = (vector(8) long long int) _2;
  *dst_28(D) = vect__2.4_32;
  return;

So at the time slp2 can realize the optimization and categorize vec_contruct cost more accurately, we can avoid this regression.
Comment 1 Richard Biener 2021-07-29 06:55:56 UTC
The basic-block vectorizer is currently limited as to what "existing" vectors it recognizes.  In this testcase we're accessing only the lowpart of 'src',
something we cannot yet model in vectorizable_slp_permutation.  The specific
case isn't hard to fix, we'd get

  <bb 2> [local count: 1073741824]:
  _31 = VIEW_CONVERT_EXPR<vector(8) int>(src_18(D));
  vect__2.4_33 = [vec_unpack_lo_expr] _31;
  vect__2.4_34 = [vec_unpack_hi_expr] _31;
  MEM <vector(4) long long int> [(long long int *)&tem] = vect__2.4_33;
  MEM <vector(4) long long int> [(long long int *)&tem + 32B] = vect__2.4_34;
  _17 = MEM[(v8di *)&tem];
  *dst_28(D) = _17;
  tem ={v} {CLOBBER};
  return;

so we then fail to elide the temporary, producing

bar_s32_s64:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        vpmovsxdq       %xmm0, %ymm1
        vextracti128    $0x1, %ymm0, %xmm0
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        andq    $-64, %rsp
        subq    $8, %rsp
        vpmovsxdq       %xmm0, %ymm0
        vmovdqa %ymm1, -56(%rsp)
        vmovdqa %ymm0, -24(%rsp)
        vmovdqa64       -56(%rsp), %zmm2
        vmovdqa64       %zmm2, (%rdi)
        leave
        .cfi_def_cfa 7, 8
        ret

it looks like there's no V8SI->V8DI conversion optab or we choose V4DI
for some other reason as prefered vector mode.
Comment 2 Hongtao.liu 2021-07-29 07:03:40 UTC
> it looks like there's no V8SI->V8DI conversion optab or we choose V4DI
> for some other reason as prefered vector mode.

We have, just need to add -mprefer-vector-width=512, the we'll get

bar_s32_s64:
  vpmovsxdq zmm0, ymm0
  vmovdqa64 ZMMWORD PTR [rdi], zmm0
  ret
Comment 3 Richard Biener 2022-05-20 09:03:08 UTC
Some pending enhancements would allow us to use VEC_PERM_EXPR with different input modes from output mode and thus make implementation of this easier.
vectorizable_slp_permutation doesn't yet support that though.

For the special case of a contiguous permutation we can also vectorize it
as BIT_FIELD_REF of course.
Comment 4 Hongtao.liu 2022-05-20 09:13:36 UTC
Guess we need to extend backend hook to handle different input and output modes.
Comment 5 Richard Biener 2022-05-20 09:25:24 UTC
(In reply to Hongtao.liu from comment #4)
> Guess we need to extend backend hook to handle different input and output
> modes.

Yes, alternatively as said, some special cases could be directly handled.
For example v16si -> v8si could be handled by VEC_PERM <lowpart, highpart, {..}>
without any extra magic (but IIRC we don't have a way to query target support
for specific BIT_FIELD_REFs which we'd use for getting at the lowpart
or highpart and if not available those would fall back to memory).
And contiguous permutes could be directly emitted as BIT_FIELD_REFs
(in some cases).

I have a half-way patch that does the preparatory work but leaves
vectorizable_slp_permutation unchanged so we immediately fail there
due to

  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
    {
      if (!vect_maybe_update_slp_op_vectype (child, vectype)
          || !types_compatible_p (SLP_TREE_VECTYPE (child), vectype))
        {
          if (dump_enabled_p ())
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "Unsupported lane permutation\n");
          return false;

the comment above that says

  /* ???  We currently only support all same vector input and output types
     while the SLP IL should really do a concat + select and thus accept
     arbitrary mismatches.  */

so it was designed to handle more, it wasn't just necessary to implement it ...
Comment 6 Richard Biener 2022-05-25 13:05:01 UTC
Created attachment 53031 [details]
patch

This works now - the support for enhanced vec_perm_const is still not complete on trunk (it claims all is OK ...) so it will ICE for testcases that would require this.  But lowpart extracts and concats (untested) should work.

I'll extend coverage once the dependences are on trunk.  For the testcase at hand
we now generate

bar_s32_s64:
.LFB0:
        .cfi_startproc
        vpmovsxdq       %ymm0, %zmm0
        vmovdqa64       %zmm0, (%rdi)
        ret

with AVX512.
Comment 7 GCC Commits 2022-06-02 06:46:21 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:08afab6f8642f58f702010ec196dce3b00955627

commit r13-926-g08afab6f8642f58f702010ec196dce3b00955627
Author: Richard Biener <rguenther@suse.de>
Date:   Tue May 31 09:37:05 2022 +0200

    tree-optimization/101668 - relax SLP of existing vectors
    
    This relaxes the conditions on SLPing extracts from existing vectors
    leveraging the relaxed VEC_PERM conditions on the input vs output
    vector type compatibility.  It also handles lowpart extracts
    and concats without VEC_PERMs now.
    
    2022-05-25  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/101668
            * tree-vect-slp.cc (vect_build_slp_tree_1): Allow BIT_FIELD_REFs
            for vector types with compatible lane types.
            (vect_build_slp_tree_2): Deal with this.
            (vect_add_slp_permutation): Adjust.  Emit lowpart/concat
            special cases without VEC_PERM.
            (vectorizable_slp_permutation): Select the operand vector
            type and relax requirements.  Handle identity permutes
            with mismatching operand types.
            * optabs-query.cc (can_vec_perm_const_p): Only allow variable
            permutes for op_mode == mode.
    
            * gcc.target/i386/pr101668.c: New testcase.
            * gcc.dg/vect/bb-slp-pr101668.c: Likewise.
Comment 8 Richard Biener 2022-06-02 06:47:20 UTC
Fixed (lowpart of existing vector).  The vectorizer now should also handle
other things like extract even but the target needs to support this in its
vec_perm_const handling which now allows different input/output modes.