Hello, when we have two VEC_PERM_EXPR with constant mask, where one is the only user of the result of the other one, it would be good to compose/merge them into a single VEC_PERM_EXPR. However, it is too hard for backends to always generate optimal code for shuffles, so we want to do the optimization only if we know it actually helps. Currently this means when the composed permutation is the identity. In the future, it could mean asking the backend. See the conversation that started at: http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00676.html and around this message for cost hooks (which could also help the vectorizer): http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00973.html Related bug is http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 but that one is about RTL (unless x86 eventually follows ARM and decides to implement _mm_* functions in terms of __builtin_shuffle).
Confirmed.
*** Bug 102056 has been marked as a duplicate of this bug. ***
Here is example code which should produce the same code: typedef int v4si __attribute__((vector_size (16))); v4si foo (v4si a, v4si b) { v4si c = __builtin_shuffle (a, b, __extension__ (v4si) {1, 4, 2, 7}); v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 }); return d; } typedef int v4si __attribute__((vector_size (16))); v4si foo1 (v4si a, v4si b) { v4si c = __builtin_shuffle (a, b, __extension__ (v4si){ 7, 2, 1, 4 }); return c; }
Now more x86 shuffle instrinsics are folded into gimple VEC_PERM_EXPR, I guess we need some Gimple-level pattern match to simplify successive vec_perm_expr.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:b88adba751da635c6f0c353c5bc51bbe2ecf4c89 commit r13-3212-gb88adba751da635c6f0c353c5bc51bbe2ecf4c89 Author: Liwei Xu <liwei.xu@intel.com> Date: Fri Sep 23 13:46:02 2022 +0800 Optimize nested permutation to single VEC_PERM_EXPR [PR54346] This patch implemented the optimization in PR 54346, which Merges c = VEC_PERM_EXPR <a, b, VCST0>; d = VEC_PERM_EXPR <c, c, VCST1>; to d = VEC_PERM_EXPR <a, b, NEW_VCST>; Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} tree-ssa/forwprop-19.c fail to pass but I'm not sure whether it is ok to removed it. gcc/ChangeLog: PR tree-optimization/54346 * match.pd: Merge the index of VCST then generates the new vec_perm. gcc/testsuite/ChangeLog: * gcc.dg/pr54346.c: New test. Co-authored-by: liuhongt <hongtao.liu@intel.com>
The log says that this breaks tree-ssa/forwprop-19.c, but I don't see any xfail or anything. Does it only fail because gimple-simplify leaves some dead code around, so you could update the test to scan the next DCE pass dump instead of forwprop1? Or are we missing a transformation that just detects a VEC_PERM_EXPR with an identity permutation?
(In reply to Marc Glisse from comment #6) > The log says that this breaks tree-ssa/forwprop-19.c, but I don't see any > xfail or anything. Does it only fail because gimple-simplify leaves some > dead code around, so you could update the test to scan the next DCE pass > dump instead of forwprop1? Or are we missing a transformation that just > detects a VEC_PERM_EXPR with an identity permutation? Uoops, I didn't notice this, will add an incremental patch to handle the indentical index (forwporp-19.c) scenario, sorry.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:fa553ff26d96f6fecaa8f1b00649cfdc6cda5f5a commit r13-3430-gfa553ff26d96f6fecaa8f1b00649cfdc6cda5f5a Author: Jakub Jelinek <jakub@redhat.com> Date: Fri Oct 21 09:16:44 2022 +0200 match.pd: Fix up gcc.dg/pr54346.c on i686-linux [PR54346] The pr54346.c testcase FAILs on i686-linux (without -msse*) for multiple reasons. One is the trivial missing -Wno-psabi which the following patch adds, but that isn't enough. The thing is that without native vector support, we have VEC_PERM_EXPRs in the IL and are actually considering the nested VEC_PERM_EXPRs into one VEC_PERM_EXPR optimization, but punt because can_vec_perm_const_p (result_mode, op_mode, sel2, false) is false. Such a test makes sense to prevent "optimizing" two VEC_PERM_EXPRs that can be handled by the backend natively into one VEC_PERM_EXPR that can't be handled. But if both of the original VEC_PERM_EXPRs can't be handled natively either, having just one VEC_PERM_EXPR that will be lowered by generic vec lowering is IMHO still better than 2. Or even if we trade just one VEC_PERM_EXPR that can't be handled plus one that can to one that can't be handled. Also, removing the testcase's executable permissions... 2022-10-21 <jakub@redhat.com> PR tree-optimization/54346 * match.pd ((vec_perm (vec_perm@0 @1 @2 VECTOR_CST) @0 VECTOR_CST)): Optimize nested VEC_PERM_EXPRs even if target can't handle the new one provided we don't increase number of VEC_PERM_EXPRs the target can't handle. * gcc.dg/pr54346.c: Add -Wno-psabi to dg-options.
(In reply to Hongtao.liu from comment #7) > (In reply to Marc Glisse from comment #6) > > The log says that this breaks tree-ssa/forwprop-19.c, but I don't see any > > xfail or anything. Does it only fail because gimple-simplify leaves some > > dead code around, so you could update the test to scan the next DCE pass > > dump instead of forwprop1? Or are we missing a transformation that just > > detects a VEC_PERM_EXPR with an identity permutation? > > Uoops, I didn't notice this, will add an incremental patch to handle the > indentical index (forwporp-19.c) scenario, sorry. It's fixed.