int x[4]; void foo(int i) { i = (i+1) & 31; x[0] = (x[0] << i) + i; x[1] = (x[1] << i) + i; x[2] = (x[2] << i) + i; x[3] = (x[3] << i) + i; } for this testcase we're not correctly assessing that we leave the scalar computation for (i+1) & 31 around for the generated vectorized shift by a scalar argument. Since we're in need of the vector of { i,... } for the vectorized add we ideally should simply use lane zero for the shift operand rather than the original SSA name as we do. Currently: _22 = {i_14(D), i_14(D), i_14(D), i_14(D)}; vect_cst__23 = _22; vect_cst__24 = { 1, 1, 1, 1 }; vect__1.6_25 = vect_cst__23 + vect_cst__24; vect_cst__26 = { 31, 31, 31, 31 }; vect_i_15.7_27 = vect__1.6_25 & vect_cst__26; _1 = i_14(D) + 1; i_15 = _1 & 31; vect__2.5_21 = MEM <vector(4) int> [(int *)&x]; vect__3.8_28 = vect__2.5_21 << i_15; vect__4.9_29 = vect__3.8_28 + vect_i_15.7_27; MEM <vector(4) int> [(int *)&x] = vect__4.9_29; return; where arguably we should have done the (i+1) & 31 compute with scalars and broadcast the result. Assembly: foo: .LFB0: .cfi_startproc leal 1(%rdi), %eax movdqa x(%rip), %xmm1 movd %edi, %xmm3 andl $31, %eax pshufd $0, %xmm3, %xmm0 paddd .LC0(%rip), %xmm0 movq %rax, %xmm2 pand .LC1(%rip), %xmm0 pslld %xmm2, %xmm1 paddd %xmm1, %xmm0 movaps %xmm0, x(%rip) ret which is really bad. Even with that optimization simulated by using 'i' as provided by the function argument we generate foo: .LFB0: .cfi_startproc movdqa x(%rip), %xmm1 movslq %edi, %rax movd %edi, %xmm3 movq %rax, %xmm2 pshufd $0, %xmm3, %xmm0 pslld %xmm2, %xmm1 paddd %xmm1, %xmm0 movaps %xmm0, x(%rip) ret so RTL optimizations do not recover here either. combine sees (insn 8 3 9 2 (set (reg:DI 89 [ i ]) (sign_extend:DI (reg/v:SI 86 [ i ]))) "t.c":5:16 147 {*extendsidi2_rex64} (nil)) (insn 10 9 11 2 (set (reg:V4SI 90 [ vect__2.6 ]) (ashift:V4SI (reg:V4SI 91 [ MEM <vector(4) int> [(int *)&x] ]) (reg:DI 89 [ i ]))) "t.c":5:16 3739 {ashlv4si3} (expr_list:REG_DEAD (reg:V4SI 91 [ MEM <vector(4) int> [(int *)&x] ]) (expr_list:REG_DEAD (reg:DI 89 [ i ]) (nil)))) (insn 11 10 12 2 (set (reg:V4SI 92) (vec_duplicate:V4SI (reg/v:SI 86 [ i ]))) "t.c":5:22 5169 {*vec_dupv4si} (expr_list:REG_DEAD (reg/v:SI 86 [ i ]) (nil))) (insn 12 11 13 2 (set (reg:V4SI 93 [ vect__3.7 ]) (plus:V4SI (reg:V4SI 90 [ vect__2.6 ]) (reg:V4SI 92))) "t.c":5:22 3545 {*addv4si3} so the opportunity to "back-propagate" the vec_duplicate for the shift isn't appearant - nor would it ever consider such thing. So we probably should try to fix this in the vectorizer. IIRC this support for non-external/constant scalar shift args is reasonably new (g:49eab32e6e7). Also if there's a vector-vector shift we should probably prefer that if we already have a vector.
Mine.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:c78907d514d65483c7ddfb4cb1f5c57f23da73d9 commit r11-1633-gc78907d514d65483c7ddfb4cb1f5c57f23da73d9 Author: Richard Biener <rguenther@suse.de> Date: Wed Jun 24 15:49:00 2020 +0200 tree-optimization/95866 - avoid vectorizing uniform SLP subgraphs This avoids vectorizing SLP subgraphs that just compute uniform operations on all-same operands. That fixes the less interesting (but most embarrasing) part of the testcase in the PR. On the way it also fixed a missing matches[0] reset in the last refactoring touching that place. 2020-06-24 Richard Biener <rguenther@suse.de> PR tree-optimization/95866 * tree-vect-slp.c (vect_slp_tree_uniform_p): New. (vect_build_slp_tree_2): Properly reset matches[0], ignore uniform constants. * gcc.target/i386/pr95866-1.c: New testcase.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:86ce59b4f05d8f68ec4d9a14a7732acb370412db commit r11-1647-g86ce59b4f05d8f68ec4d9a14a7732acb370412db Author: Richard Biener <rguenther@suse.de> Date: Thu Jun 25 11:21:20 2020 +0200 tree-optimization/95866 - avoid using scalar ops for vectorized shift This avoids using the original scalar SSA operand when vectorizing a shift with a vectorized shift operand where we know all vector components have the same value and thus we can use a vector by scalar shift. Using the scalar SSA operand causes a possibly long chain of scalar computation to be retained so it's better to simply extract lane zero from the available vectorized shift operand. 2020-06-25 Richard Biener <rguenther@suse.de> PR tree-optimization/95866 * tree-vect-stmts.c (vectorizable_shift): Reject incompatible vectorized shift operands. For scalar shifts use lane zero of a vectorized shift operand. * gcc.dg/vect/bb-slp-pr95866.c: New testcase.
Now fixed.