Test case: extern void test(unsigned int t[4][4]); void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) { unsigned int tmp[4][4]; unsigned int a0, a1, a2, a3; for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][2] = t0 - t2; tmp[i][1] = t1 + t3; tmp[i][3] = t1 - t3; } test(tmp); } The expected code on ppc64le can look like: // p1 byte 0 to byte 7 d1_0_7 = load_dword(p1) // p1+i1 b0 to b7, rename it as 8 to 15 d1_8_15 = load_dword(p1 + i1) d1_16_23 = load_dword(p1 + 2*i1) d1_24_31 = load_dword(p1 + 3*i1) V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char V_d1_16_31 = construct_vec(d1_16_23,d1_24_31) V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27}) V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31}) // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all // Do the subtraction together (all 4x4 bytes) V_sub1 = V_d1_0_3_all - V_d2_0_3_all V_sub2 = V_d1_4_7_all - V_d2_4_7_all // Do some unpack and get the promoted vector int V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28} V_a0_1 = V_a0_tmp << 16 V_a0_0 = vec_promote(V_sub1, {0 1 2 3}). // vector int {b0 b8 b16 b24} // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3} V_a0 = V_a0_0 + V_a0_1 // Get the similar for V_a1, V_a2, V_a3 // Compute t0/t1/t2/t3 // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3} V_t0 = V_a0 + V_a1 V_t1 = V_a0 - V_a1 V_t2 = V_a2 + V_a3 V_t3 = V_a2 - V_a3 // Compute tmps // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]} V_tmp0 = V_t0 + V_t2 V_tmp2 = V_t0 - V_t2 V_tmp1 = V_t1 + V_t3 V_tmp3 = V_t1 - V_t3 // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ... // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
Similar case is x264_pixel_satd_8x4 in x264 https://github.com/mirror/x264/blob/4121277b40a667665d4eea1726aefdc55d12d110/common/pixel.c#L288
So the expected vectorization builds vectors { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] } that's not SLP, SLP tries to build the { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } vector and "succeeds" - the SLP tree turns out to be highly inefficient though. So for the stores your desire is to see an interleaving scheme with VF 4 (the number of iterations). But interleaving fails because it would require a VF of 16 and there are not enough iteration in the loop. The classical SLP scheme degenerates (also due to the plus/minus mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3} expression. Starting SLP discovery from the grouped loads would get things going up to the above same expression. So not sure what's the best approach to this case. The testcase can be simplified still showing the SLP discovery issue: extern void test(unsigned int t[4][4]); void foo(int *p1, int i1, int *p2, int i2) { unsigned int tmp[4][4]; unsigned int a0, a1, a2, a3; for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { a0 = (p1[0] - p2[0]); a1 = (p1[1] - p2[1]); a2 = (p1[2] - p2[2]); a3 = (p1[3] - p2[3]); int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][2] = t0 - t2; tmp[i][1] = t1 + t3; tmp[i][3] = t1 - t3; } test(tmp); } So it's basically SLP discovery degenerating to an interleaving scheme on the load side but not actually "implementing" it.
(In reply to Richard Biener from comment #2) > So the expected vectorization builds vectors > > { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] } > > that's not SLP, SLP tries to build the > > { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } > > vector and "succeeds" - the SLP tree turns out to be > highly inefficient though. So for the stores your desire > is to see an interleaving scheme with VF 4 (the number of > iterations). But interleaving fails because it would require > a VF of 16 and there are not enough iteration in the loop. > > The classical SLP scheme degenerates (also due to the plus/minus > mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3} > expression. > > Starting SLP discovery from the grouped loads would get things going > up to the above same expression. > > So not sure what's the best approach to this case. The testcase > can be simplified still showing the SLP discovery issue: > > extern void test(unsigned int t[4][4]); > > void foo(int *p1, int i1, int *p2, int i2) > { > unsigned int tmp[4][4]; > unsigned int a0, a1, a2, a3; > > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { > a0 = (p1[0] - p2[0]); > a1 = (p1[1] - p2[1]); > a2 = (p1[2] - p2[2]); > a3 = (p1[3] - p2[3]); > > int t0 = a0 + a1; > int t1 = a0 - a1; > int t2 = a2 + a3; > int t3 = a2 - a3; > > tmp[i][0] = t0 + t2; > tmp[i][2] = t0 - t2; > tmp[i][1] = t1 + t3; > tmp[i][3] = t1 - t3; > } > test(tmp); > } > > So it's basically SLP discovery degenerating to an interleaving scheme > on the load side but not actually "implementing" it. IIUC, in current implementation, we get four grouped stores: { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently When all these tryings fail, could we do some re-try on the groups { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/ with one extra intermediate layer covering those original groups, then start from these newly adjusted groups? the built operands should isomorphic then. May be too hackish?
(In reply to Kewen Lin from comment #3) > > IIUC, in current implementation, we get four grouped stores: > { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently > > When all these tryings fail, could we do some re-try on the groups > { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/ > with one extra intermediate layer covering those original groups, then start > from these newly adjusted groups? the built operands should isomorphic then. > May be too hackish? This thought seems impractical in current framework. I was thinking whether we can use another vectorization way which is sub-optimal but more fits with the current framework. It only focuses on all things in one loop iteration and starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can be: w1_0123 = load_word(p1) V1_0123 = scalar_to_vec(w1_0123) // also with promotion w1_4567 = load_word(p1+4) V1_4567 = scalar_to_vec(w1_4567) w2_0123 = load_word(p2) V2_0123 = scalar_to_vec(w2_0123) w2_4567 = load_word(p2+4) V2_4567 = scalar_to_vec(w2_4567) V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16 V1 = V_a0123 // {a0, a1, a2, a3} V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2} V3 = V1 - V2 // {t1, -t1, t3, -t3} V4 = V1 + V2 // {t0, t0, t2, t2} V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3} V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1} V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1} V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1} V9 = vperm (V7, V8, <4, 5, 0, 1>) vec_store (tmp[i], V9) One simplified case can be: extern void test(unsigned int t[4]); void foo(int *p1, int i1, int *p2, int i2) { unsigned int tmp[4]; unsigned int a0, a1, a2, a3; a0 = (p1[0] - p2[0]); a1 = (p1[1] - p2[1]); a2 = (p1[2] - p2[2]); a3 = (p1[3] - p2[3]); int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[0] = t0 + t2; tmp[2] = t0 - t2; tmp[1] = t1 + t3; tmp[3] = t1 - t3; test(tmp); } Currently we still can't vectorize this simple case. SLP doesn't go deep enough to expose the group loads chance on the bottom nodes loading p1/p2. It ends up early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition all_uniform_p so "Building parent vector operands from scalars instead". One rough idea seems: 1) Relax this condition all_uniform_p somehow to get SLP instance building to go deeper and get those p1/p2 loads as SLP nodes. 2) Introduce one more vect_pattern recognizer to catch this kind of pattern, transform the slp instance as we expect. I assume we can know the whole slp instance then we can transform it as we want here. Probably need some costing condition to gate this pattern matching. 3) If 2) fail, trim the slp instance from those nodes which satisfy all_uniform_p condition to ensure it's same as before. Does it sound reasonable?
(In reply to Kewen Lin from comment #4) > One rough idea seems: > 1) Relax this condition all_uniform_p somehow to get SLP instance building > to go deeper and get those p1/p2 loads as SLP nodes. > 2) Introduce one more vect_pattern recognizer to catch this kind of > pattern, transform the slp instance as we expect. I assume we can know the > whole slp instance then we can transform it as we want here. Probably need > some costing condition to gate this pattern matching. > 3) If 2) fail, trim the slp instance from those nodes which satisfy > all_uniform_p condition to ensure it's same as before. > For 2), instead of vect_pattern with IFN, the appropriate place seems to be vect_optimize_slp. But after more thinking, building SLP instance starting from group loads instead of group stores looks more straightforward. a0 = (p1[0] - p2[0]); a1 = (p1[1] - p2[1]); a2 = (p1[2] - p2[2]); a3 = (p1[3] - p2[3]); Building the vector <a0, a1, a2, a3> looks more natural and then check the uses of its all lanes and special patterns to have vector <t0, t1, t2, t3> and repeat similarly. Hi Richi, Is this a good example to request SLP instance build starting group loads?
Starting from the loads is not how SLP discovery works so there will be zero re-use of code. Sure - the only important thing is you end up with a valid SLP graph. But going back to the original testcase and the proposed vectorization for power - is that faster in the end? For the "rewrite" of the vectorizer into all-SLP we do have to address that "interleaving scheme not carried out as interleaving" at some point, but that's usually for loop vectorization - for BB vectorization all we have is optimize_slp. I have patches that would build the vector load SLP node (you still have to kill that 'build from scalars' thing to make it trigger ). But then we end up with a shared vector load node and N extract/splat operations at the 'scalar' points. It's not entirely clear to me how to re-arrange the SLP graph at that point. Btw, on current trunk the simplified testcase no longer runs into the 'scalar operand' build case but of course vectorization is thought to be not profitable. pattern recog of the plus/minus subgraphs may help (not sure if ppc has those as instruction, x86 has). That said, "failure" to identify the common (vector) load is known and I do have experimental patches trying to address that but did not yet arrive at a conclusive "best" approach.
(In reply to Richard Biener from comment #6) > Starting from the loads is not how SLP discovery works so there will be > zero re-use of code. Sure - the only important thing is you end up > with a valid SLP graph. > > But going back to the original testcase and the proposed vectorization > for power - is that faster in the end? > Good question. I coded foo2 and foo3 with altivec built-in functions as the proposals, foo2 is for the original version (#c0) and foo3 is for the sub-optimial version (#c4), they have been verified with some random inputs to ensure the functionality. I evaluated the timings on Power8/Power9/Power10(pre) with option -Ofast and corresponding -mcpu=power{8,9,10} for the build. Using foo1 (scalar version) as the baseline 100, do the normalization, the speed-ups look below (more bigger, more better): foo1 foo2 foo3 P8 100 163.26 164.97 P9 100 101.15 101.60 P10 100 172.42 159.10 The evaluation shows the vectorized versions can have big gains on Power8 and Power10, but the gain is trivial on Power9. I tried to evaluate P8 executable on P9, it didn't have any remarkable differences. There are some known issues on P9 for fp/vect intensive workloads, it's probably due to that since the vectorized versions are full of vector codes. Sigh, we might have to make different costs for P9 but that's another story. Btw, I tried to run P9 executable on P10, it can also gains that much like what we have in P10 row. So it's profitable to vectorize this as P8 and P10 testing show. > For the "rewrite" of the vectorizer into all-SLP we do have to address > that "interleaving scheme not carried out as interleaving" at some point, > but that's usually for loop vectorization - for BB vectorization all > we have is optimize_slp. I have patches that would build the vector load > SLP node (you still have to kill that 'build from scalars' thing to make > it trigger ). But then we end up with a shared vector load node and > N extract/splat operations at the 'scalar' points. It's not entirely > clear to me how to re-arrange the SLP graph at that point. > Great, it's good to know we can end up with a shared vector load node. It looks we can further simplify the graph on top of it, for #c4 simplified case, assuming we have two shared vector load nodes N1 = <p1[0], p1[1], p1[2], p1[3]> N2 = <p2[0], p2[1], p2[2], p2[3]> 1) If all lanes of one shared vector load node only works with all lanes of another shared vector load nodes, we can move up (postpone) the extract/splat operations. For this case, we needs 4 extract/splat for p1's lanes, 4 extract/splat for p2's lanes, 4 minus for {T,T,T,T} /T=a{0,1,2,3}/, it can be simplified to one shared node N3 = N1 - N2 (1 minus) and 4 extract/splat. 2) And then we have the shared vector node <a0,a1,a2,a3>, before it's going to extract/splat to {T,T,T,T} /T=a{0,1,2,3}/, it can check all the uses of extract/splat lanes, try to detect some pattern and further simplify. For this case, it can be permuted to <a1,a0,a3,a2>, do the minus/plus and the permutations to get the nodes N4 <_13, _14, _13, _14> and N5 <_15, _16, _15, _16> to save the need of extract/splat. 3) Further we can detect the pattern on SLP graph and skip the need to gen N4 and N5, but gen <_13, _14, _15, _16> and repeat the minus/plus and permutations for the required t1/t2/t3/t4 computation. > Btw, on current trunk the simplified testcase no longer runs into the > 'scalar operand' build case but of course vectorization is thought to be > not profitable. pattern recog of the plus/minus subgraphs may help > (not sure if ppc has those as instruction, x86 has). You meant the simplified case in #c4? I re-based the latest trunk (r11-6601) but still suffered the 'scalar operand' build, unfortunately ppc doesn't support that kind of instructions, sigh. > > That said, "failure" to identify the common (vector) load is known > and I do have experimental patches trying to address that but did > not yet arrive at a conclusive "best" approach. Thanks for the information!
Created attachment 49942 [details] vectorized with altivec built-in functions
The full satd_8x4 looks like the following, the 2nd loop isn't to be disregarded typedef unsigned char uint8_t; typedef unsigned short uint16_t; typedef unsigned int uint32_t; #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ int t0 = s0 + s1;\ int t1 = s0 - s1;\ int t2 = s2 + s3;\ int t3 = s2 - s3;\ d0 = t0 + t2;\ d2 = t0 - t2;\ d1 = t1 + t3;\ d3 = t1 - t3;\ } static inline uint32_t abs2( uint32_t a ) { uint32_t s = ((a>>15)&0x10001)*0xffff; return (a+s)^s; } int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) { uint32_t tmp[4][4]; uint32_t a0, a1, a2, a3; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); } for( int i = 0; i < 4; i++ ) { HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] ); sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); } return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; }
Note that current clang does a pretty decent job on this now on aarch64 (in case it gives some inspiration on the approach) https://godbolt.org/z/EPvqMhh7v
> The full satd_8x4 looks like the following, the 2nd loop isn't to be > disregarded Regarding the second loop, this patch https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608827.html should result in improved vectorization and performance. Currently ((a>>15)&0x10001)*0xffff from abs2 is calculated using 4 vector operations (shift, bitand, shift+sub for the multiplication) whereas with the proposed patch this will be a single vector compare operation.
Hi Richi, > That said, "failure" to identify the common (vector) load is known > and I do have experimental patches trying to address that but did > not yet arrive at a conclusive "best" approach. It was long time ago, so do you have the "best" approach now? Thanks, -Jiangning
(In reply to Jiangning Liu from comment #12) > Hi Richi, > > > That said, "failure" to identify the common (vector) load is known > > and I do have experimental patches trying to address that but did > > not yet arrive at a conclusive "best" approach. > > It was long time ago, so do you have the "best" approach now? I do have ideas but how it will play out is unknown. Also the current priority is to improve maintainability of the code with getting rid of the non-SLP data structure using paths in the vectorizer. Next would be to redo loop SLP discovery from scratch which would include possibly fixing this issue. > Thanks, > -Jiangning
Btw, previous work is at refs/users/rguenth/heads/load-perm