[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step

Mon Jan 24 16:49:36 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #22 from Roger Sayle <roger at nextmovesoftware dot com> ---
I completely agree with Richard that the decision to vectorize or not to
vectorize should be made elsewhere taking the whole function/loop into account.
 It's quite reasonable to synthesize a slow vector multiply if there's an
overall benefit from SLP.  What I think is required is that the "baseline" cost
should be the cost of moving from the vector to a scalar mode, performing the
multiplication(s) as a scalar and moving the result back again.  i.e. we're
assuming that we're always going to multiply the value in a vector register,
we're just choosing the cheapest implementation for it.  For the xxhash.i
testcase, I'm seeing DI mode multiplications with COSTS_N_INSNS(30) [i.e. a
mult_cost of 120]. Even with slow inter-unit moves it must be possible to do
this faster on AArch64?  In fact, we'll probably vectorize more in SLP, if we
have the option to shuffle data back to the scalar multiplier if required.
Perhaps even a define_insn_and_split of mulv2di3 to fool the middle-end into
thinking we can do this "natively" via an optab.

Note that multipliers used in cryptographic hash functions are sometimes
(chosen to be) pathological to synth_mult.  Like the design of DES' sboxes,
these are coefficients designed to be slow to implement in software [and faster
in custom hardware].  64bit values with around 32 (random) bits set.

I/we can try to speed up the recursion in synth_mult, and/or increase the size
of the hash-table cache [which will help hppa64 and other targets with slow
multipliers] but that's perhaps just working around the deeper issue with this
PR.