This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 16 Aug 2016 07:43:10 +0000
- Subject: [Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers
- Authentication-results: sourceware.org; auth=none
- Auto-submitted: auto-generated
- References: <bug-74585-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Bill Schmidt from comment #11)
> With the original test case, -mcpu=power8 is problematic because of the use
> of the "swapping stores," whose RHS is a vec_select rather than a register
> or subreg. This prevents us from saving the RHS of the store for use in
> replacing subsequent loads, running afoul of this logic in
> dse.c:record_store ():
>
> if (GET_CODE (body) == SET
> /* No place to keep the value after ra. */
> && !reload_completed
> && (REG_P (SET_SRC (body)) <= this part
> || GET_CODE (SET_SRC (body)) == SUBREG
> || CONSTANT_P (SET_SRC (body)))
> && !MEM_VOLATILE_P (mem)
> /* Sometimes the store and reload is used for truncation and
>
> rounding. */
> && !(FLOAT_MODE_P (GET_MODE (mem)) && (flag_float_store)))
>
> We can circumvent this if we can use stvx to force the parameters to the
> stack, which is legal since the stack slots are properly aligned.
>
> However, even using -mcpu=power9, we don't handle removing the stores and
> replacing the partial loads with register logic.
You mean stores like the following?
(insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
(const_int 112 [0x70])) [1 a+48 S16 A128])
(vec_select:V4SI (reg:V4SI 190)
(parallel [
(const_int 2 [0x2])
(const_int 3 [0x3])
(const_int 0 [0])
(const_int 1 [0x1])
]))) t.c:14 -1
(nil))
I wonder why dse can't simply force the rhs to a register? Of course if
power really has stores that do this vec_select but no non-store with
the operation then this might not be valid ...
Now, in the end this example just shows that lowering register passing
only at RTL expansion leads to a load of missed optimizations regarding
to parameter setup ... some scheme to apply the lowering on GIMPLE already
would be interesting to explore (but albeit quite a bit of work). We'd
have a second set of "parameter decls" somewhere, like in struct function,
and use that when the IL is on lowered form. Same for DECL_RESULT of course.
And then the interesting part is whether to expose the stack in some way or
restrict the lowering to decomposition/combining to registers.