Bug 101929 - [13/14/15/16/17 Regression] r12-7319 regress x264_r by 4% on CLX.
Summary: [13/14/15/16/17 Regression] r12-7319 regress x264_r by 4% on CLX.
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.0
: P2 normal
Target Milestone: 13.5
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 98138
Blocks: spec
  Show dependency treegraph
 
Reported: 2021-08-16 05:08 UTC by Hongtao.liu
Modified: 2026-04-22 14:27 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-02-22 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hongtao.liu 2021-08-16 05:08:55 UTC
The regression is in x264_pixel_satd_8x4

typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
typedef unsigned short uint16_t;

// in: a pseudo-simd number of the form x+(y<<16)
// return: abs(x)+(abs(y)<<16)
static inline
uint32_t abs2( uint32_t a )
{
    uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

int
x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
    }
    for( int i = 0; i < 4; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] );
        sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }
    return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}

after increase cost of vector CTOR, slp1 won't vector for below
git diff my.slp1 original.slp1

-  _820 = {_187, _189, _187, _189};
-  vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820);
-  vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821;
-  vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821;
-  _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>;
-  vect__192.68_825 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_824);
   t3_190 = (int) _189;
   _191 = t0_184 + t2_188;
   _192 = (unsigned int) _191;
+  tmp[0][0] = _192;
   _194 = t0_184 - t2_188;
   _195 = (unsigned int) _194;
+  tmp[0][2] = _195;
   _197 = t1_186 + t3_190;
   _198 = (unsigned int) _197;
+  tmp[0][1] = _198;
   _200 = t1_186 - t3_190;
   _201 = (unsigned int) _200;
-  MEM <vector(4) unsigned int> [(unsigned int *)&tmp] = vect__192.68_825;
+  tmp[0][3] = _201;

but the vectorized version can somehow help fre to eliminate redundant vector load and then got even better performace.

git diff dump.veclower21 dump.fre5

   MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B] = vect__54.89_852;
-  vect__63.9_482 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp];
-  vect__64.12_478 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B];
-  vect__65.13_477 = vect__63.9_482 + vect__64.12_478;
+  vect__65.13_477 = vect__192.68_825 + vect__273.75_834;
   vect_t0_100.14_476 = VIEW_CONVERT_EXPR<vector(4) int>(vect__65.13_477);
-  vect__67.15_475 = vect__63.9_482 - vect__64.12_478;
+  vect__67.15_475 = vect__192.68_825 - vect__273.75_834;
   vect_t1_101.16_474 = VIEW_CONVERT_EXPR<vector(4) int>(vect__67.15_475);
-  vect__68.19_470 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 32B];
-  vect__69.22_466 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B];
-  vect__70.23_465 = vect__68.19_470 + vect__69.22_466;
+  vect__70.23_465 = vect__354.82_843 + vect__54.89_852;

If slp1 can realize this and add the upper part to comparison of scalar cost vs vector cost, gcc should do vectorization, but currently it doesn't.
Comment 1 Hongtao.liu 2021-08-16 05:10:36 UTC
Considering this, I'm debating whether to revert my patch.
Comment 2 Hongtao.liu 2021-08-16 05:21:05 UTC
W/o accurate info provided by vectorizer, the backend can do nothing about this regression except reverting the patch, that's why i marked the bugzilla ad tree-optimization component.
Comment 3 Richard Biener 2021-08-16 09:43:37 UTC
It's interesting to note that in

-  _820 = {_187, _189, _187, _189};
-  vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820);
-  vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821;
-  vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821;
-  _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>;

we only need parts of the CTOR for the add/sub parts (because we ignore
some lanes with the blend).  That might even allow to elide the final
compose of the low/high part and expose some more insn parallelism.

Of course that looks quite difficult to achieve.

--

Note your CTOR cost estimates might be off given the CTORs are mostly
regular like

{ _181, _181, _181, _181, _262, _262, _262, _262, _343, _343, _343, _343, _48, _48, _48, _48 }

thus could use 4 splats to xmm and 4 inserts?  For the V4SI vectorization
we unfortunately decide to do

t.c:37:9: note:   Using a splat of the uniform operand
t.c:37:9: note:   Using a splat of the uniform operand
t.c:37:9: note:   Building parent vector operands from scalars instead

and thus end up with { _49, _50, _49, _50 }.  That said, I don't think
the backend gets easy access to the actual CTOR layout yet to improve costing
(similar as with permutes and the actual permute mask).

--

It's difficult (if not impossible) for the vectorizer to second-guess
the followup FRE, we're a long way from doing loop + SLP vectorization
in one go and discover we can elide the vector store.
Comment 4 GCC Commits 2021-08-19 02:29:39 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:1db70e61a92978377a648bbd90e383859fc0126b

commit r12-3011-g1db70e61a92978377a648bbd90e383859fc0126b
Author: liuhongt <hongtao.liu@intel.com>
Date:   Tue Aug 17 17:29:06 2021 +0800

    Revert "Add the member integer_to_sse to processor_cost as a cost simulation for movd/pinsrd. It will be used to calculate the cost of vec_construct."
    
    This reverts commit 872da9a6f664a06d73c987aa0cb2e5b830158a10.
    
    PR target/101936
    PR target/101929
Comment 5 Richard Biener 2022-02-08 13:36:45 UTC
The rev. was reverted, see PR98138 for vectorization of this kernel.
Comment 6 Richard Biener 2022-02-22 08:01:57 UTC
Proactively re-opening.  I will see whether something can be done.
Comment 7 Richard Biener 2022-03-07 07:18:43 UTC
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 9188d727e33..7f1f12fb6c6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2374,7 +2375,7 @@ fail:
                n_vector_builds++;
            }
        }
-      if (all_uniform_p
+      if ((all_uniform_p && !two_operators)
          || n_vector_builds > 1
          || (n_vector_builds == children.length ()
              && is_a <gphi *> (stmt_info->stmt)))


will re-enable the vectorization - it evades the vect_construct cost bump
by instead using scalar_to_vec (aka splat) which has not yet been fixed to
account for a possible gpr to xmm move (so it would be a temporary "solution"
at best).

Another change to mute the effect somewhat (but not fixing x264) that was mentioned is

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b2bf90576d5..acf2cc977b4 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
       case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
-         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);

which makes sense as the cost of the initial value of the xmm destination is
now costed separately when from GPR and free when from xmm.
Comment 8 Hongtao.liu 2022-03-07 08:22:01 UTC
(In reply to Richard Biener from comment #7)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 9188d727e33..7f1f12fb6c6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -2374,7 +2375,7 @@ fail:
>                 n_vector_builds++;
>             }
>         }
> -      if (all_uniform_p
> +      if ((all_uniform_p && !two_operators)
>           || n_vector_builds > 1
>           || (n_vector_builds == children.length ()
>               && is_a <gphi *> (stmt_info->stmt)))
> 
> 
> will re-enable the vectorization - it evades the vect_construct cost bump
> by instead using scalar_to_vec (aka splat) which has not yet been fixed to
> account for a possible gpr to xmm move (so it would be a temporary "solution"
> at best).
> 
> Another change to mute the effect somewhat (but not fixing x264) that was
> mentioned is
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..acf2cc977b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
> vect_cost_for_stmt type_of_cost,
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) *
> ix86_cost->sse_op;

(In reply to Richard Biener from comment #7)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 9188d727e33..7f1f12fb6c6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -2374,7 +2375,7 @@ fail:
>                 n_vector_builds++;
>             }
>         }
> -      if (all_uniform_p
> +      if ((all_uniform_p && !two_operators)
>           || n_vector_builds > 1
>           || (n_vector_builds == children.length ()
>               && is_a <gphi *> (stmt_info->stmt)))
> 
> 
> will re-enable the vectorization - it evades the vect_construct cost bump
> by instead using scalar_to_vec (aka splat) which has not yet been fixed to
> account for a possible gpr to xmm move (so it would be a temporary "solution"
> at best).
> 
> Another change to mute the effect somewhat (but not fixing x264) that was
> mentioned is
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..acf2cc977b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
> vect_cost_for_stmt type_of_cost,
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) *
> ix86_cost->sse_op;
n - 1 is right for 128-bit vector, but for 256-bit vector, shouldn't it be n - 2, since we have a separate cost for vinserti128, and n - 4 for 512-bit one.
Comment 9 rguenther@suse.de 2022-03-07 09:03:54 UTC
On Mon, 7 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929
> 
> --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #7)
> > Another change to mute the effect somewhat (but not fixing x264) that was
> > mentioned is
> > 
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index b2bf90576d5..acf2cc977b4 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
> > vect_cost_for_stmt type_of_cost,
> >        case vec_construct:
> >         {
> >           /* N element inserts into SSE vectors.  */
> > -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> > +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) *
> > ix86_cost->sse_op;
> n - 1 is right for 128-bit vector, but for 256-bit vector, shouldn't it be n -
> 2, since we have a separate cost for vinserti128, and n - 4 for 512-bit one.

True!  Note that without SLP the gpr->xmm move cost is not yet accounted
for (for loops the cases where we will need an actual gpr->xmm move
will be restricted to CTORs emitted in the prologue - in-loop cases
will always come from memory, so it might not be too important to get
that correct for the non-SLP case).
Comment 10 Richard Biener 2022-03-11 14:05:10 UTC
Should be fixed with r12-7612.
Comment 11 Hongtao.liu 2022-05-20 03:56:30 UTC
105493
Comment 12 Hongtao.liu 2022-05-20 04:05:01 UTC
 
> It's difficult (if not impossible) for the vectorizer to second-guess
> the followup FRE, we're a long way from doing loop + SLP vectorization
> in one go and discover we can elide the vector store.

I'm thinking of adding some detect in the vectorizer to find the "fre pair" of the new vectorized store and existed vector load, then eliminate vector_store cost in add_stmt_cost since it's probably be eliminated.

New vector store:  MEM <vector(4) unsigned int> [(unsigned int *)&tmp] = vect__192.68_825;

Existed vector load below: vect__63.9_482 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp]
Comment 13 Richard Biener 2022-05-20 06:17:00 UTC
(In reply to Hongtao.liu from comment #12)
>  
> > It's difficult (if not impossible) for the vectorizer to second-guess
> > the followup FRE, we're a long way from doing loop + SLP vectorization
> > in one go and discover we can elide the vector store.
> 
> I'm thinking of adding some detect in the vectorizer to find the "fre pair"
> of the new vectorized store and existed vector load, then eliminate
> vector_store cost in add_stmt_cost since it's probably be eliminated.
> 
> New vector store:  MEM <vector(4) unsigned int> [(unsigned int *)&tmp] =
> vect__192.68_825;
> 
> Existed vector load below: vect__63.9_482 = MEM <vector(4) unsigned int>
> [(unsigned int *)&tmp]

I think it would be more useful to explore whether we can find a special
kind of "SLP seed" by looking for vector loads that load from a store
group previously identified.  We can then have the vector _load_ be the
SLP seed similar as to how we handle vector CTORs.  The only special
sauce would be that we need to perform dependence analysis and make the
"store" cheaper (but only if it is really dead afterwards which would
mean doing some kind of DSE analysis - but of course we can offset
the vector load cost that goes away, maybe that alone helps).
Comment 14 Drea Pinski 2023-12-11 20:24:42 UTC
PR 98138  is exactly the same loop ...