This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH], PR target/80718, Improve PowerPC splat double word
- From: Richard Sandiford <rdsandiford at googlemail dot com>
- To: Michael Meissner <meissner at linux dot vnet dot ibm dot com>
- Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, Segher Boessenkool <segher at kernel dot crashing dot org>, David Edelsohn <dje dot gcc at gmail dot com>
- Date: Tue, 23 May 2017 15:17:25 +0100
- Subject: Re: [PATCH], PR target/80718, Improve PowerPC splat double word
- Authentication-results: sourceware.org; auth=none
- References: <20170522183244.GA22334@ibm-tiger.the-meissners.org>
Michael Meissner <meissner@linux.vnet.ibm.com> writes:
> When I was comparing spec 2006 numbers between GCC 6.3 and 7.1, there was one
> benchmark that was noticeably slower (milc). In looking at the code generated,
> the #1 hot function (mult_adj_su3_mat_vec) had some cases where automatic
> vectorization generated splat of double from memory.
>
> The register allocator did not use the load with splat instruction (LXVDSX)
> because all of the loads were register+offset. For the scalar values that it
> could load into the FPR registers, it used the normal register+offset load
> (d-form). For the other scalar values that would wind up in the traditional
> Altivec registers, the register allocator decided to load up the value into a
> GPR register and do a direct move.
>
> Now, it turns out that while the above code is inefficient, it is not a cause
> for slow down of the milc benchmark. However there might be other places where
> using a load, direct move, and double word permute are causing a performance
> problem, so I made this patch.
>
> The patch splits the splat into a register splat and a memory splat. This
> forces the register allocator to convert the load to the indexed form which the
> LXVDSX instruction uses. I did a spec 2006 run with these changes, and there
> were no significant performance differences with this patch.
>
> In the mult_adj_su3_mat_vec function, there were previously 5 GPR loads, direct
> move, and permute sequences along with one LXVDSK. With this patch, those GPR
> loads have been replaced with LXVDSKs.
It sounds like that might create the opposite problem, in that if the RA
ends up having to spill the input operand of a register splat, it'll be
forced to keep the instruction as a register splat and load the spill
slot into a temporary register.
I thought the advice was normally to do what the port did before the
patch and present the register and memory versions as alternatives of a
single pattern. If there's no current way of getting efficient code in
that case then maybe we need to tweak some of the costing infrastructure...
Thanks,
Richard