Consider this code: void foo(int * restrict a, int b, int *c) { a[0] = c[0]*b + a[0]; a[1] = c[2]*b + a[1]; a[2] = c[1]*b + a[2]; a[3] = c[3]*b + a[3]; } Prior to this commit: commit 126ed72b9f48f8530b194532cc281fb761690435 Author: Richard Biener <rguenther@suse.de> Date: Wed Sep 30 17:08:01 2020 +0200 optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts This introduces a permute optimization phase for SLP which is intended to cover the existing permute eliding for SLP reductions plus handling commonizing the easy cases. It currently uses graphds to compute a postorder on the reverse SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts did (hopefully - I've adjusted most testcases that triggered it a few days ago). It restricts itself to move around bijective permutations to simplify things for now, mainly around constant nodes. As a prerequesite it makes the SLP graph cyclic (ugh). It looks like it would pay off to compute a PRE/POST order visit array once and elide all the recursive SLP graph walks and their visited hash-set. At least for the time where we do not change the SLP graph during such walk. I do not like using graphds too much but at least I don't have to re-implement yet another RPO walk, so maybe it isn't too bad. It now computes permute placement during iteration and thus should get cycles more obviously correct. [ ... ] GCC would generate this (x86_64 -O3 -march=native): vect__1.6_27 = VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 }>; vect__2.7_29 = vect__1.6_27 * _28; _1 = *c_18(D); _2 = _1 * b_19(D); vectp.9_30 = a_20(D); vect__3.10_31 = MEM <vector(4) int> [(int *)vectp.9_30]; vect__4.11_32 = vect__2.7_29 + vect__3.10_31; This is good. Note how the VEC_PERM_EXPR happens before the vector multiply and how the vector multiply directly feeds the vector add. On our target we have a vector multiply-add which would be generated and all is good. After the above change we generate this: vect__2.6_28 = vect__1.5_25 * _27; _29 = VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>; _1 = *c_18(D); _2 = _1 * b_19(D); vectp.8_30 = a_20(D); vect__3.9_31 = MEM <vector(4) int> [(int *)vectp.8_30]; vect__4.10_32 = _29 + vect__3.9_31; Note how we have the vmul, then permute, then vadd. This spoils our ability to generate a vmadd. This behavior is still seen on the trunk as well. Conceptually it seems to me that having a permute at the start or end of a chain of vector operations is better than moving the permute into the middle of a chain of dependent vector operations. We could probably fix this in the backend with some special patterns, but ISTM that getting it right in SLP would be better.
I wonder if we should do something like in match.pd anyways (unrelated to this issue): (simplify (vec_perm (any_binary:s (VEC_DUP/CONSTRUCTOR@0) @1)@2 @2 @3) (any_binary @0 (vec_perm @1 @1 @3)))
Confirmed. void foo(float * restrict a, float b, float *c) { a[0] = c[0]*b + a[0]; a[1] = c[2]*b + a[1]; a[2] = c[1]*b + a[2]; a[3] = c[3]*b + a[3]; } shows the issue on x86_64 with a lack of an FMA. One complication is that FMA forming is done in a later pass only thus the vectorizer has no guidance to decide on placement of the permute. Note the vectorizer permute optimization propagates in one direction only (but the optimistic pieces), the intent is to reduce the number of permutes which almost exclusively come from loads.
Understood WRT phase ordering. That was fully expected. What I still don't understand is why moving the permute down is profitable here or generally why moving a permute into a dependency chain is profitable. ISTM that hoisting a permute up a dependency chain so that it's fed by a load is profitable. Similarly sinking a permute to the end of a dependency chain so that it feeds a store is profitable. Moving a permute so that it feeds or is fed by another permute is probably profitable too. Otherwise moving a permute into the middle of a dependency chain seems like a pessimization.
(In reply to Jeffrey A. Law from comment #3) > Understood WRT phase ordering. That was fully expected. > > What I still don't understand is why moving the permute down is profitable > here or generally why moving a permute into a dependency chain is profitable. > > ISTM that hoisting a permute up a dependency chain so that it's fed by a > load is profitable. Similarly sinking a permute to the end of a dependency > chain so that it feeds a store is profitable. Moving a permute so that it > feeds or is fed by another permute is probably profitable too. Otherwise > moving a permute into the middle of a dependency chain seems like a > pessimization. Yes, I agree. As said vect_optimize_slp misses a part that pushes the permutes back towards the loads. At the moment it is designed to move everything towards the stores (or reductions) with the intend to get rid of them. If that doesn't succeed the final placement of the permutes isn't optimized - only the number of permutes is.
Patch proposed. https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591644.html
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:49fb0af9bf8f16907980d383c2bbc85e185ec2e0 commit r12-7653-g49fb0af9bf8f16907980d383c2bbc85e185ec2e0 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Tue Mar 15 09:05:28 2022 +0000 PR tree-optimization/101895: Fold VEC_PERM to help recognize FMA. This patch resolves PR tree-optimization/101895 a missed optimization regression, by adding a costant folding simplification to match.pd to simplify the transform "mult; vec_perm; plus" into "vec_perm; mult; plus" with the aim that keeping the multiplication and addition next to each other allows them to be recognized as fused-multiply-add on suitable targets. This transformation requires a tweak to match.pd's vec_same_elem_p predicate to handle CONSTRUCTOR_EXPRs using the same SSA_NAME_DEF_STMT idiom used for constructors elsewhere in match.pd. The net effect is that the following code example: void foo(float * __restrict__ a, float b, float *c) { a[0] = c[0]*b + a[0]; a[1] = c[2]*b + a[1]; a[2] = c[1]*b + a[2]; a[3] = c[3]*b + a[3]; } when compiled on x86_64-pc-linux-gnu with -O2 -march=cascadelake currently generates: vbroadcastss %xmm0, %xmm0 vmulps (%rsi), %xmm0, %xmm0 vpermilps $216, %xmm0, %xmm0 vaddps (%rdi), %xmm0, %xmm0 vmovups %xmm0, (%rdi) ret but with this patch now generates the improved: vpermilps $216, (%rsi), %xmm1 vbroadcastss %xmm0, %xmm0 vfmadd213ps (%rdi), %xmm0, %xmm1 vmovups %xmm1, (%rdi) ret 2022-03-15 Roger Sayle <roger@nextmovesoftware.com> Marc Glisse <marc.glisse@inria.fr> Richard Biener <rguenther@suse.de> gcc/ChangeLog PR tree-optimization/101895 * match.pd (vec_same_elem_p): Handle CONSTRUCTOR_EXPR def. (plus (vec_perm (mult ...) ...) ...): New reordering simplification. gcc/testsuite/ChangeLog PR tree-optimization/101895 * gcc.target/i386/pr101895.c: New test case.
This should now be fixed on mainline.
Fixed on trunk, keeping for reference purposes (backporting is not intended). There's also still the possibility to improve how the vectorizer handles this instead of fixing up after the fact (but that's for the future).
Yea, no need to backport this.
And just an FYI. As expected this resolves the regression on our internal target. Thanks Roger!
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
GCC 11.4 is being released, retargeting bugs to GCC 11.5.
Fixed in GCC 12.