In a loop reduction path containing a lane-reduced operation (DOT_PROD/SAD/WIDEN_SUM), current vectorizer could not handle the pattern if there are other operations, which might be a normal or another lane-reduced one. A pseudo example is represented as: char *d0, *d1; char *s0, *s1; char *w; int *n; ... int sum = 0; for (i) { ... sum += d0[i] * d1[i]; /* DOT_PROD */ ... sum += abs(s0[i] - s1[i]); /* SAD */ ... sum += w[i]; /* WIDEN_SUM */ ... sum += n[i]; /* Normal */ ... } ... = sum; For the case, reduction vectype would vary with operations, and this causes mismatch on count of vectorized defs and uses, a possible means might be fixing that by generating extra trivial pass-through copies. Given a concrete example as: sum = 0; for (i) { sum += d0[i] * d1[i]; /* 16*char -> 4*int */ sum += n[i]; /* 4*int -> 4*int */ } Final vetorized statements could be: sum_v0 = { 0, 0, 0, 0 }; sum_v1 = { 0, 0, 0, 0 }; sum_v2 = { 0, 0, 0, 0 }; sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 += DOT_PROD (v_d0[i: 0 .. 15], v_d1[i: 0 .. 15]); sum_v1 += 0; // copy sum_v2 += 0; // copy sum_v3 += 0; // copy sum_v0 += v_n[i: 0 .. 3]; sum_v1 += v_n[i: 4 .. 7]; sum_v2 += v_n[i: 8 .. 11]; sum_v3 += v_n[i: 12 .. 15]; } sum = REDUC_PLUS(sum_v0 + sum_v1 + sum_v2 + sum_v3); In the above sequence, one summation statement simply forms one pattern. Though, we could easily compose a somewhat more complicated variant that gets into the similar situation. That is, a chain of lane-reduced operations comes from the non-reduction addend in one summation statement, like: sum += d0[i] * d1[i] + abs(s0[i] - s1[i]) + n[i]; Probably, this requires some extension in the vector pattern formation stage to split the patterns.
It looks like this should be possible, but of course this performs more "weird" re-association which might be a problem with floating-point (though I don't know of any lane-reducing ops implemented for FP). Representing the "scalar" side in the vectorizer IL is tricky though, this is why the current handling is separated and not integrated with the rest (so it doesn't compose as you noticed). Note we lack SLP recognition for dot_prod and sad also because we do not specify which lanes are combined, so the optabs are a black-box and only useful when the resulting lanes are reduced.
The master branch has been updated by Feng Xue <fxue@gcc.gnu.org>: https://gcc.gnu.org/g:178cc419512f7e358f88dfe2336625aa99cd7438 commit r15-2096-g178cc419512f7e358f88dfe2336625aa99cd7438 Author: Feng Xue <fxue@os.amperecomputing.com> Date: Wed May 29 17:22:36 2024 +0800 vect: Support multiple lane-reducing operations for loop reduction [PR114440] For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> sum += w[i]; // widen-sum <vector(16) char> sum += abs(s0[i] - s1[i]); // sad <vector(8) short> } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 2024-03-22 Feng Xue <fxue@os.amperecomputing.com> gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Adjust comments with updated example. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c
The master branch has been updated by Feng Xue <fxue@gcc.gnu.org>: https://gcc.gnu.org/g:db3c8c9726d0bafbb9f85b6d7027fe83602643e7 commit r15-2097-gdb3c8c9726d0bafbb9f85b6d7027fe83602643e7 Author: Feng Xue <fxue@os.amperecomputing.com> Date: Wed May 29 17:28:14 2024 +0800 vect: Optimize order of lane-reducing operations in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> sum += w[i]; // widen-sum <vector(16) char> sum += abs(s0[i] - s1[i]); // sad <vector(8) short> sum += n[i]; // normal <vector(4) int> } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } 2024-03-22 Feng Xue <fxue@os.amperecomputing.com> gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order.