Bug 92175 - x86 backend claims V4SI multiplication support, preventing more optimal pattern
Summary: x86 backend claims V4SI multiplication support, preventing more optimal pattern
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
Reported: 2019-10-22 08:44 UTC by Richard Biener
Modified: 2020-01-29 14:06 UTC (History)
2 users (show)

See Also:
Target: x86_64-*-*, i?86-*-*
Known to work:
Known to fail:
Last reconfirmed: 2020-01-29 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2019-10-22 08:44:49 UTC
Costing has

19010         /* Without sse4.1, we don't have PMULLD; it's emulated with 7
19011            insns, including two PMULUDQ.  */
19012         else if (mode == V4SImode && !(TARGET_SSE4_1 || TARGET_AVX))
19013           return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5);

but for a testcase doing just x * 2 that is excessive.  The vectorizer
would change that to x << 1 via vect_recog_mult_pattern (yeah, oddly
not to x + x ...).

This causes SSE vectorization to be disregarded easily, falling back to
MMX "emulation" mode which doesn't claim V4SImode multiplication support
producing essentially SSE code but with only half of the lanes doing useful

I'm not sure if pattern recog should try costing here.  Certainly the
vectorizer won't try the PMULUDQ variant if the backend would claim to
not support V4SImode mult.

Noticed for the testcase in PR92173.
Comment 1 Richard Biener 2019-10-22 08:52:09 UTC
Jakub, you did the mult pattern recog - any opinions?  (also why do I see
a << 1 instead of a + a?)
Comment 2 Jakub Jelinek 2019-10-22 09:26:45 UTC
Something should compare the costs.  Either vect_recog_mult_pattern should move the mul_optab != unknown_optab etc. check after vect_synth_mult_by_constant, compare the costs of the pattern recognized sequence vs. of the multiplication and if vector multiplication is beneficial, undo whatever vect_synth_mult_by_constant added.
Or the cost function for vector multiplication should special case multiplication by constant and also expansion of vector multiplication should do the same plus compare costs.
I bet the first option would be easier.
As for v << 1 vs. v + v, there is already synth_lshift_by_additions, so we could force using it for LSHIFT_EXPR by 1 even for !synth_shift_p (would that be unconditionally a win?).
OT, the indentation introduced in r238340 has quite a lot of issues, many functions calls have misindented arguments.