Costing has 19010 /* Without sse4.1, we don't have PMULLD; it's emulated with 7 19011 insns, including two PMULUDQ. */ 19012 else if (mode == V4SImode && !(TARGET_SSE4_1 || TARGET_AVX)) 19013 return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5); but for a testcase doing just x * 2 that is excessive. The vectorizer would change that to x << 1 via vect_recog_mult_pattern (yeah, oddly not to x + x ...). This causes SSE vectorization to be disregarded easily, falling back to MMX "emulation" mode which doesn't claim V4SImode multiplication support producing essentially SSE code but with only half of the lanes doing useful work. I'm not sure if pattern recog should try costing here. Certainly the vectorizer won't try the PMULUDQ variant if the backend would claim to not support V4SImode mult. Noticed for the testcase in PR92173.
Jakub, you did the mult pattern recog - any opinions? (also why do I see a << 1 instead of a + a?)
Something should compare the costs. Either vect_recog_mult_pattern should move the mul_optab != unknown_optab etc. check after vect_synth_mult_by_constant, compare the costs of the pattern recognized sequence vs. of the multiplication and if vector multiplication is beneficial, undo whatever vect_synth_mult_by_constant added. Or the cost function for vector multiplication should special case multiplication by constant and also expansion of vector multiplication should do the same plus compare costs. I bet the first option would be easier. As for v << 1 vs. v + v, there is already synth_lshift_by_additions, so we could force using it for LSHIFT_EXPR by 1 even for !synth_shift_p (would that be unconditionally a win?). OT, the indentation introduced in r238340 has quite a lot of issues, many functions calls have misindented arguments.