Bug 111697 - Sub optimal code gen for initialising vector using loop
Summary: Sub optimal code gen for initialising vector using loop
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2023-10-04 19:25 UTC by prathamesh3492
Modified: 2024-03-28 11:26 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-10-04 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description prathamesh3492 2023-10-04 19:25:36 UTC
Hi,
For the following test-case:

typedef int v4si __attribute__((vector_size (sizeof (int) * 4)));
v4si f(int x)
{
  v4si v;
  for (int i = 0; i < 4; i++)
    v[i] = x;
  return v;
}

Compiling with -O2 results in following .optimized dump:

v4si f (int x)
{
  v4si v;

  <bb 2> [local count: 214748368]:
  v_16 = BIT_INSERT_EXPR <v_12(D), x_6(D), 0 (32 bits)>;
  v_20 = BIT_INSERT_EXPR <v_16, x_6(D), 32 (32 bits)>;
  v_24 = BIT_INSERT_EXPR <v_20, x_6(D), 64 (32 bits)>;
  v_2 = BIT_INSERT_EXPR <v_24, x_6(D), 96 (32 bits)>;
  return v_2;

}

and following code-gen on aarch64:
f:
        movi    v0.4s, 0
        fmov    s31, w0
        ins     v0.s[0], v31.s[0]
        ins     v0.s[1], v31.s[0]
        ins     v0.s[2], v31.s[0]
        ins     v0.s[3], v31.s[0]
        ret

which could instead be a single dup instruction:
f:
        dup     v0.4s, w0
        ret

Similarly, code-gen on x86_64:
f:
        movd    %edi, %xmm0
        movd    %edi, %xmm1
        pshufd  $225, %xmm0, %xmm0
        movss   %xmm1, %xmm0
        pshufd  $225, %xmm0, %xmm0
        pshufd  $198, %xmm0, %xmm0
        movss   %xmm1, %xmm0
        pshufd  $198, %xmm0, %xmm0
        pshufd  $39, %xmm0, %xmm0
        movss   %xmm1, %xmm0
        pshufd  $39, %xmm0, %xmm0
        ret
Comment 1 Andrew Pinski 2023-10-04 19:31:53 UTC
Confirmed. PR 58497 is basically the same issue in the end. I had patches for this but I was not 100% sure it was handling that in a decent location.
Comment 2 Richard Biener 2023-10-05 07:46:48 UTC
We have quite some code doing vector CTOR stuff in tree-ssa-forwprop.cc and this should be optimized to

 v_2 = { x_6(D), x_6(D), x_6(D), x_6(D) };

note SLP vectorization can do this but it fails because it doesn't handle
a default def insert - it handles a group of BIT_INSERT_EXPRs as
vector CTOR and SLP discovery doesn't know how to start from external defs
(it needs actual definition stmts).

A more general approach would be to try to track vector construction through
symbolic execution like we form bswap in the bswap pass.
Comment 3 Richard Biener 2023-10-05 07:48:10 UTC
(In reply to Richard Biener from comment #2)
> We have quite some code doing vector CTOR stuff in tree-ssa-forwprop.cc and
> this should be optimized to
> 
>  v_2 = { x_6(D), x_6(D), x_6(D), x_6(D) };
> 
> note SLP vectorization can do this but it fails because it doesn't handle
> a default def insert - it handles a group of BIT_INSERT_EXPRs as
> vector CTOR and SLP discovery doesn't know how to start from external defs
> (it needs actual definition stmts).
> 
> A more general approach would be to try to track vector construction through
> symbolic execution like we form bswap in the bswap pass.

You could "steal" the code in vect_slp_check_for_roots,

      else if (code == BIT_INSERT_EXPR
               && VECTOR_TYPE_P (TREE_TYPE (rhs))
               && TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)).is_constant ()
               && TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)).to_constant () > 1
               && integer_zerop (gimple_assign_rhs3 (assign))
               && useless_type_conversion_p
                    (TREE_TYPE (TREE_TYPE (rhs)),
                     TREE_TYPE (gimple_assign_rhs2 (assign)))
               && bb_vinfo->lookup_def (gimple_assign_rhs2 (assign)))
        {
          /* We start to match on insert to lane zero but since the
             inserts need not be ordered we'd have to search both
             the def and the use chains.  */
...

and put it into tree-ssa-forwprop.cc, explicitly creating the vector CTOR.