The regression is in x264_pixel_satd_8x4 typedef unsigned char uint8_t; typedef unsigned int uint32_t; typedef unsigned short uint16_t; // in: a pseudo-simd number of the form x+(y<<16) // return: abs(x)+(abs(y)<<16) static inline uint32_t abs2( uint32_t a ) { uint32_t s = ((a>>15)&0x10001)*0xffff; return (a+s)^s; } #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ int t0 = s0 + s1;\ int t1 = s0 - s1;\ int t2 = s2 + s3;\ int t3 = s2 - s3;\ d0 = t0 + t2;\ d2 = t0 - t2;\ d1 = t1 + t3;\ d3 = t1 - t3;\ } int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) { uint32_t tmp[4][4]; uint32_t a0, a1, a2, a3; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); } for( int i = 0; i < 4; i++ ) { HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] ); sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); } return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; } after increase cost of vector CTOR, slp1 won't vector for below git diff my.slp1 original.slp1 - _820 = {_187, _189, _187, _189}; - vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820); - vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821; - vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821; - _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>; - vect__192.68_825 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_824); t3_190 = (int) _189; _191 = t0_184 + t2_188; _192 = (unsigned int) _191; + tmp[0][0] = _192; _194 = t0_184 - t2_188; _195 = (unsigned int) _194; + tmp[0][2] = _195; _197 = t1_186 + t3_190; _198 = (unsigned int) _197; + tmp[0][1] = _198; _200 = t1_186 - t3_190; _201 = (unsigned int) _200; - MEM <vector(4) unsigned int> [(unsigned int *)&tmp] = vect__192.68_825; + tmp[0][3] = _201; but the vectorized version can somehow help fre to eliminate redundant vector load and then got even better performace. git diff dump.veclower21 dump.fre5 MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B] = vect__54.89_852; - vect__63.9_482 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp]; - vect__64.12_478 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B]; - vect__65.13_477 = vect__63.9_482 + vect__64.12_478; + vect__65.13_477 = vect__192.68_825 + vect__273.75_834; vect_t0_100.14_476 = VIEW_CONVERT_EXPR<vector(4) int>(vect__65.13_477); - vect__67.15_475 = vect__63.9_482 - vect__64.12_478; + vect__67.15_475 = vect__192.68_825 - vect__273.75_834; vect_t1_101.16_474 = VIEW_CONVERT_EXPR<vector(4) int>(vect__67.15_475); - vect__68.19_470 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 32B]; - vect__69.22_466 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B]; - vect__70.23_465 = vect__68.19_470 + vect__69.22_466; + vect__70.23_465 = vect__354.82_843 + vect__54.89_852; If slp1 can realize this and add the upper part to comparison of scalar cost vs vector cost, gcc should do vectorization, but currently it doesn't.
Considering this, I'm debating whether to revert my patch.
W/o accurate info provided by vectorizer, the backend can do nothing about this regression except reverting the patch, that's why i marked the bugzilla ad tree-optimization component.
It's interesting to note that in - _820 = {_187, _189, _187, _189}; - vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820); - vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821; - vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821; - _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>; we only need parts of the CTOR for the add/sub parts (because we ignore some lanes with the blend). That might even allow to elide the final compose of the low/high part and expose some more insn parallelism. Of course that looks quite difficult to achieve. -- Note your CTOR cost estimates might be off given the CTORs are mostly regular like { _181, _181, _181, _181, _262, _262, _262, _262, _343, _343, _343, _343, _48, _48, _48, _48 } thus could use 4 splats to xmm and 4 inserts? For the V4SI vectorization we unfortunately decide to do t.c:37:9: note: Using a splat of the uniform operand t.c:37:9: note: Using a splat of the uniform operand t.c:37:9: note: Building parent vector operands from scalars instead and thus end up with { _49, _50, _49, _50 }. That said, I don't think the backend gets easy access to the actual CTOR layout yet to improve costing (similar as with permutes and the actual permute mask). -- It's difficult (if not impossible) for the vectorizer to second-guess the followup FRE, we're a long way from doing loop + SLP vectorization in one go and discover we can elide the vector store.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:1db70e61a92978377a648bbd90e383859fc0126b commit r12-3011-g1db70e61a92978377a648bbd90e383859fc0126b Author: liuhongt <hongtao.liu@intel.com> Date: Tue Aug 17 17:29:06 2021 +0800 Revert "Add the member integer_to_sse to processor_cost as a cost simulation for movd/pinsrd. It will be used to calculate the cost of vec_construct." This reverts commit 872da9a6f664a06d73c987aa0cb2e5b830158a10. PR target/101936 PR target/101929
The rev. was reverted, see PR98138 for vectorization of this kernel.
Proactively re-opening. I will see whether something can be done.
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 9188d727e33..7f1f12fb6c6 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -2374,7 +2375,7 @@ fail: n_vector_builds++; } } - if (all_uniform_p + if ((all_uniform_p && !two_operators) || n_vector_builds > 1 || (n_vector_builds == children.length () && is_a <gphi *> (stmt_info->stmt))) will re-enable the vectorization - it evades the vect_construct cost bump by instead using scalar_to_vec (aka splat) which has not yet been fixed to account for a possible gpr to xmm move (so it would be a temporary "solution" at best). Another change to mute the effect somewhat (but not fixing x264) that was mentioned is diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index b2bf90576d5..acf2cc977b4 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, case vec_construct: { /* N element inserts into SSE vectors. */ - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; + int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) == 256) cost += ix86_vec_cost (mode, ix86_cost->addss); which makes sense as the cost of the initial value of the xmm destination is now costed separately when from GPR and free when from xmm.
(In reply to Richard Biener from comment #7) > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 9188d727e33..7f1f12fb6c6 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -2374,7 +2375,7 @@ fail: > n_vector_builds++; > } > } > - if (all_uniform_p > + if ((all_uniform_p && !two_operators) > || n_vector_builds > 1 > || (n_vector_builds == children.length () > && is_a <gphi *> (stmt_info->stmt))) > > > will re-enable the vectorization - it evades the vect_construct cost bump > by instead using scalar_to_vec (aka splat) which has not yet been fixed to > account for a possible gpr to xmm move (so it would be a temporary "solution" > at best). > > Another change to mute the effect somewhat (but not fixing x264) that was > mentioned is > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index b2bf90576d5..acf2cc977b4 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum > vect_cost_for_stmt type_of_cost, > case vec_construct: > { > /* N element inserts into SSE vectors. */ > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > + int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * > ix86_cost->sse_op; (In reply to Richard Biener from comment #7) > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 9188d727e33..7f1f12fb6c6 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -2374,7 +2375,7 @@ fail: > n_vector_builds++; > } > } > - if (all_uniform_p > + if ((all_uniform_p && !two_operators) > || n_vector_builds > 1 > || (n_vector_builds == children.length () > && is_a <gphi *> (stmt_info->stmt))) > > > will re-enable the vectorization - it evades the vect_construct cost bump > by instead using scalar_to_vec (aka splat) which has not yet been fixed to > account for a possible gpr to xmm move (so it would be a temporary "solution" > at best). > > Another change to mute the effect somewhat (but not fixing x264) that was > mentioned is > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index b2bf90576d5..acf2cc977b4 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum > vect_cost_for_stmt type_of_cost, > case vec_construct: > { > /* N element inserts into SSE vectors. */ > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > + int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * > ix86_cost->sse_op; n - 1 is right for 128-bit vector, but for 256-bit vector, shouldn't it be n - 2, since we have a separate cost for vinserti128, and n - 4 for 512-bit one.
On Mon, 7 Mar 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929 > > --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> --- > (In reply to Richard Biener from comment #7) > > Another change to mute the effect somewhat (but not fixing x264) that was > > mentioned is > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index b2bf90576d5..acf2cc977b4 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum > > vect_cost_for_stmt type_of_cost, > > case vec_construct: > > { > > /* N element inserts into SSE vectors. */ > > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > > + int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * > > ix86_cost->sse_op; > n - 1 is right for 128-bit vector, but for 256-bit vector, shouldn't it be n - > 2, since we have a separate cost for vinserti128, and n - 4 for 512-bit one. True! Note that without SLP the gpr->xmm move cost is not yet accounted for (for loops the cases where we will need an actual gpr->xmm move will be restricted to CTORs emitted in the prologue - in-loop cases will always come from memory, so it might not be too important to get that correct for the non-SLP case).
Should be fixed with r12-7612.
105493
> It's difficult (if not impossible) for the vectorizer to second-guess > the followup FRE, we're a long way from doing loop + SLP vectorization > in one go and discover we can elide the vector store. I'm thinking of adding some detect in the vectorizer to find the "fre pair" of the new vectorized store and existed vector load, then eliminate vector_store cost in add_stmt_cost since it's probably be eliminated. New vector store: MEM <vector(4) unsigned int> [(unsigned int *)&tmp] = vect__192.68_825; Existed vector load below: vect__63.9_482 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp]
(In reply to Hongtao.liu from comment #12) > > > It's difficult (if not impossible) for the vectorizer to second-guess > > the followup FRE, we're a long way from doing loop + SLP vectorization > > in one go and discover we can elide the vector store. > > I'm thinking of adding some detect in the vectorizer to find the "fre pair" > of the new vectorized store and existed vector load, then eliminate > vector_store cost in add_stmt_cost since it's probably be eliminated. > > New vector store: MEM <vector(4) unsigned int> [(unsigned int *)&tmp] = > vect__192.68_825; > > Existed vector load below: vect__63.9_482 = MEM <vector(4) unsigned int> > [(unsigned int *)&tmp] I think it would be more useful to explore whether we can find a special kind of "SLP seed" by looking for vector loads that load from a store group previously identified. We can then have the vector _load_ be the SLP seed similar as to how we handle vector CTORs. The only special sauce would be that we need to perform dependence analysis and make the "store" cheaper (but only if it is really dead afterwards which would mean doing some kind of DSE analysis - but of course we can offset the vector load cost that goes away, maybe that alone helps).
PR 98138 is exactly the same loop ...