Following testcase: unsigned char ur[16], ua[16], ub[16]; void avgu_v2qi (void) { int i; for (i = 0; i < 2; i++) ur[i] = (ua[i] + ub[i] + 1) >> 1; } does not vectorize on x86_64-linux-gnu with -O2 -ftree-vectorize.
t.c:8:11: note: Costing subgraph: t.c:8:11: note: node 0x409a000 (max_nunits=2, refcnt=1) t.c:8:11: note: op template: ur[0] = _23; t.c:8:11: note: stmt 0 ur[0] = _23; t.c:8:11: note: stmt 1 ur[1] = _35; t.c:8:11: note: children 0x409a088 t.c:8:11: note: node 0x409a088 (max_nunits=2, refcnt=1) t.c:8:11: note: op template: patt_58 = (unsigned char) patt_56; t.c:8:11: note: stmt 0 patt_58 = (unsigned char) patt_56; t.c:8:11: note: stmt 1 patt_71 = (unsigned char) patt_69; t.c:8:11: note: children 0x409a110 t.c:8:11: note: node 0x409a110 (max_nunits=2, refcnt=1) t.c:8:11: note: op template: patt_56 = .AVG_CEIL (_16, _18); t.c:8:11: note: stmt 0 patt_56 = .AVG_CEIL (_16, _18); t.c:8:11: note: stmt 1 patt_69 = .AVG_CEIL (_28, _30); t.c:8:11: note: children 0x409a220 0x409a198 t.c:8:11: note: node 0x409a220 (max_nunits=2, refcnt=1) t.c:8:11: note: op template: _16 = ua[0]; t.c:8:11: note: stmt 0 _16 = ua[0]; t.c:8:11: note: stmt 1 _28 = ua[1]; t.c:8:11: note: node 0x409a198 (max_nunits=2, refcnt=1) t.c:8:11: note: op template: _18 = ub[0]; t.c:8:11: note: stmt 0 _18 = ub[0]; t.c:8:11: note: stmt 1 _30 = ub[1]; t.c:8:11: note: Cost model analysis: _23 1 times scalar_store costs 12 in body _35 1 times scalar_store costs 12 in body (unsigned char) _22 1 times scalar_stmt costs 4 in body (unsigned char) _34 1 times scalar_stmt costs 4 in body ua[0] 1 times vector_load costs 12 in body ub[0] 1 times vector_load costs 12 in body .AVG_CEIL (_16, _18) 1 times vector_stmt costs 4 in body _23 1 times vector_store costs 12 in body ua[0] 1 times vec_to_scalar costs 4 in epilogue ua[1] 1 times vec_to_scalar costs 4 in epilogue ub[0] 1 times vec_to_scalar costs 4 in epilogue ub[1] 1 times vec_to_scalar costs 4 in epilogue t.c:8:11: note: Cost model analysis for part in loop 0: Vector cost: 56 Scalar cost: 32 t.c:8:11: missed: not vectorized: vectorization is not profitable. it looks like somehow the scalar costing is off and the scalar loads from ua and ub are considered live. Possibly an artifact of patterns. It's vectorized fine with -fno-vect-cost-model. I will have a look, eventually not for GCC 12.
I think I've seen this before - the use in the conversion is elided in the vector path via recognizing a pattern of a pattern - that makes it not part of the SLP tree and thus left as SLP_TYPE (..) = loop_vect, fooling the live computation. vect_detect_hybrid_slp now does this in a more correct way but the original worklist seeding has to be done differently for BB SLP.
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>: https://gcc.gnu.org/g:cb46559cea1d554cef1138db5bfbdd0647ffbc0d commit r12-6535-gcb46559cea1d554cef1138db5bfbdd0647ffbc0d Author: Uros Bizjak <ubizjak@gmail.com> Date: Wed Jan 12 20:57:12 2022 +0100 testsuite: Compile gcc.target/i386/pr103861-3.c with -fno-vect-cost-model [PR103941] 2022-01-12 Uroš Bizjak <ubizjak@gmail.com> gcc/testsuite/ChangeLog: PR target/103941 * gcc.target/i386/pr103861-3.c (dg-options): Add -fno-vect-cost-model.
Another testcase where this occurs: void foo (int *c, float *x, float *y) { c[0] = x[0] < y[0]; c[1] = x[1] < y[1]; c[2] = x[2] < y[2]; c[3] = x[3] < y[3]; }
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:353434b65ef7972172597d232ae17022d9a57244 commit r12-8195-g353434b65ef7972172597d232ae17022d9a57244 Author: Richard Biener <rguenther@suse.de> Date: Wed Apr 13 13:49:45 2022 +0200 tree-optimization/104010 - fix SLP scalar costing with patterns When doing BB vectorization the scalar cost compute is derailed by patterns, causing lanes to be considered live and thus not costed on the scalar side. For the testcase in PR104010 this prevents vectorization which was done by GCC 11. PR103941 shows similar cases of missed optimizations that are fixed by this patch. 2022-04-13 Richard Biener <rguenther@suse.de> PR tree-optimization/104010 PR tree-optimization/103941 * tree-vect-slp.cc (vect_bb_slp_scalar_cost): When we run into stmts in patterns continue walking those for uses outside of the vectorized region instead of marking the lane live. * gcc.target/i386/pr103941-1.c: New testcase. * gcc.target/i386/pr103941-2.c: Likewise.
Fixed on trunk via the PR104010 regression fix.