[Bug target/65951] [AArch64] Will not vectorize 64bit integer multiplication

Tue May 3 01:06:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951

--- Comment #11 from Jim Wilson <wilson at gcc dot gnu.org> ---
I've spent some time looking at solutions to this problem.

One way to solve it is to simply add a mulv2di3 pattern to the aarch64 port. 
The presence of a multiply pattern means we will go through expand_mult, and we
get the shift/add sequence generation automatically.  The x86 port has a few
patterns like this.  mulv2di3 can be implemented with 3 v4si multiplies, and 3
zip instructions, plus 2 adds that fold into multiply-add, and a shift.  The
downside of this solution is that if we aren't multiplying by a constant, then
we get this long mulv2di3 sequence.  On an APM/Mustang, for a loop that only
does a multiply, this sequence is much slower than 2 integer DImode multiplies,
and hence it is better to not vectorize in this case.  This is probably not a
win in general.

Another way to solve it is to use the existing synth_mult code in expmed.c.  We
can easily share the code that generates the algorithm in choose_mult_variant,
but expand_mult_const is very rtl specific, so I copied that part with a lot of
modification to general gimple.  Again, testing on APM/Mustang, for a loop that
only does multiply, I found that a 2 instruction shift/add sequence is a win,
but a 3 instruction shift/add sequence is a lose.  Since we already handle the
1 instruction case trivially, this appears to be a lot of work for not much
gain.  This is probably a better solution than the above one if the amount of
new code is OK.  This patch needs a bit more work to finish it, and will likely
need aarch64 rtx costs adjusted so that we get the best result for all targets.

There may be wins in cases where a loop does more than a simple multiply, where
the loop is not vectorized only because of the multiply.  It isn't clear how to
quantify that.

For the original testcase, with constant 19594, this constant requires 9
operations.  We can do that with 9 vector instructions, or 5 integer
instructions.
        add     x0, x1, x1, lsl 3
        add     x6, x0, x0, lsl 4
        add     x7, x1, x6, lsl 4
        add     x8, x1, x7, lsl 2
        lsl     x9, x8, 1
Since 5 integer instructions is likely more than twice as fast as 9 vector
operations on all aarch64 parts, we still won't vectorize this loop even if we
have synth mult support in the vectorizer.  We still get an integer multiply
instruction, as that is faster than the 5 integer shift/add instructions, which
in turn is faster than the 9 vector shift/add instructions and the vector
multiply via 3 v4si multiplies.

I've attached work in progress patches for the two solutions to the PR, along
with testcases I'm using to verify the patches.