This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/81673] Harmful SLP vectorization
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 03 Aug 2017 07:56:03 +0000
- Subject: [Bug tree-optimization/81673] Harmful SLP vectorization
- Auto-submitted: auto-generated
- References: <bug-81673-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81673
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Martin Jambor from comment #3)
> (In reply to Andrew Pinski from comment #1)
> > What happens if you use -march=intel.
>
> With -mtune=intel, the lower half of the vector is moved directly
That's what the change tries to account for -- RA is unlikely to be
able to allocate a %xmm for the lower half.
> whereas the upper one is still done through the stack:
That one is not accounted for, but it's still one insert. So the patch
fixes the fact that the original cost thought the first "insert" isn't
needed because the value is already in an %xmm.
> .cfi_startproc
> leaq -56(%rsp), %rsp
> .cfi_def_cfa_offset 64
> movq %rdx, %xmm0
> movq %rcx, (%rsp)
> leaq 16(%rsp), %rdi
> movq %r9, 8(%rsp)
> movhps (%rsp), %xmm0
So with -mavx this can be a vpinsert which supports inserting from GPRs.
I wonder how fugly this insertion code gets for HImode inserts? AFAIK
there are no HImode loads to %xmm.
Anyway, precise cost modeling is difficult without factoring out a
(pessimistic)
costing routine from the vec_init expander. After all we do not know where
those constructor components come from -- they might come from a load
(in case of strided SLP or strided loads) in which case the story is different.
> movdqa %xmm0, 32(%rsp)
> movq %r8, %xmm0
> movhps 8(%rsp), %xmm0
> movdqa %xmm0, 16(%rsp)
> call bar
> leaq 56(%rsp), %rsp
> .cfi_def_cfa_offset 8
> ret
> .cfi_endproc
>
> ...so I guess this would still incur some penalty on the benchmark,
> but I am not sure.
Adding 1 should turn the tide towards not SLP vectorizing (the 2 component
vector integer case).