[Bug target/68655] SSE2 cannot vec_perm of low and high part
rguenther at suse dot de
gcc-bugzilla@gcc.gnu.org
Thu Dec 3 12:53:00 GMT 2015
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 3 Dec 2015, jakub at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655
>
> --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> I guess it needs analysis.
> Some examples of changes:
> vshuf-v16qi.c -msse2 test_2, scalar code vs. punpcklqdq, clear win
> vshuf-v16qi.c -msse4 test_2, pshufb -> punpcklqdq (is this a win or not?)
> (similarly for -mavx, -mavx2, -mavx512f, -mavx512bw)
> vshuf-v16si.c -mavx512{f,bw} test_2:
> - vpermi2d %zmm1, %zmm1, %zmm0
> + vmovdqa64 .LC2(%rip), %zmm0
> + vpermi2q %zmm1, %zmm1, %zmm0
> looks like pessimization.
> vshuf-v32hi.c -mavx512bw test_2, similar pessimization.
> vshuf-v32hi.c -mavx512bw test_2, similarly:
> - vpermi2w %zmm1, %zmm1, %zmm0
> + vmovdqa64 .LC2(%rip), %zmm0
> + vpermi2q %zmm1, %zmm1, %zmm0
> vshuf-v4si.c -msse2 test_183, another pessimization:
> - pshufd $78, %xmm0, %xmm1
> + movdqa %xmm0, %xmm1
> movd b(%rip), %xmm4
> pshufd $255, %xmm0, %xmm2
> + shufpd $1, %xmm0, %xmm1
> vshuf-v4si.c -msse4 test_183, another pessimization:
> - pshufd $78, %xmm1, %xmm0
> + movdqa %xmm1, %xmm0
> + palignr $8, %xmm0, %xmm0
> vshuf-v4si.c -mavx test_183:
> - vpshufd $78, %xmm1, %xmm0
> + vpalignr $8, %xmm1, %xmm1, %xmm0
> vshuf-v64qi.c -mavx512bw, desirable change:
> - vpermi2w %zmm1, %zmm1, %zmm0
> - vpshufb .LC3(%rip), %zmm0, %zmm1
> - vpshufb .LC4(%rip), %zmm0, %zmm0
> - vporq %zmm0, %zmm1, %zmm0
> + vpermi2q %zmm1, %zmm1, %zmm0
> vshuf-v8hi.c -msse2 test_1 another scalar to punpcklqdq, win
> vshuf-v8hi.c -msse4 test_2 (supposedly a win):
> - pshufb .LC3(%rip), %xmm0
> + punpcklqdq %xmm0, %xmm0
> vshuf-v8hi.c -mavx test_2, similarly:
> - vpshufb .LC3(%rip), %xmm0, %xmm0
> + vpunpcklqdq %xmm0, %xmm0, %xmm0
> vshuf-v8si.c -mavx2 test_2, another win:
> - vmovdqa a(%rip), %ymm0
> - vperm2i128 $0, %ymm0, %ymm0, %ymm0
> + vpermq $68, a(%rip), %ymm0
> vshuf-v8si.c -mavx2 test_5, another win:
> - vmovdqa .LC6(%rip), %ymm0
> - vmovdqa .LC7(%rip), %ymm1
> - vmovdqa %ymm0, -48(%rbp)
> vmovdqa a(%rip), %ymm0
> - vpermd %ymm0, %ymm1, %ymm1
> - vpshufb .LC8(%rip), %ymm0, %ymm3
> - vpshufb .LC10(%rip), %ymm0, %ymm0
> - vmovdqa %ymm1, c(%rip)
> - vmovdqa b(%rip), %ymm1
> - vpermq $78, %ymm3, %ymm3
> - vpshufb .LC9(%rip), %ymm1, %ymm2
> - vpshufb .LC11(%rip), %ymm1, %ymm1
> - vpor %ymm3, %ymm0, %ymm0
> - vpermq $78, %ymm2, %ymm2
> - vpor %ymm2, %ymm1, %ymm1
> - vpor %ymm1, %ymm0, %ymm0
> + vmovdqa .LC7(%rip), %ymm2
> + vmovdqa .LC6(%rip), %ymm1
> + vpermd %ymm0, %ymm2, %ymm2
> + vpermd b(%rip), %ymm1, %ymm3
> + vmovdqa %ymm1, -48(%rbp)
> + vmovdqa %ymm2, c(%rip)
> + vpermd %ymm0, %ymm1, %ymm0
> + vmovdqa .LC8(%rip), %ymm2
> + vpand %ymm2, %ymm1, %ymm1
> + vpcmpeqd %ymm2, %ymm1, %ymm1
> + vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
> vshuf-v8si.c -mavx512f test_2, another win?
> - vmovdqa a(%rip), %ymm0
> - vperm2i128 $0, %ymm0, %ymm0, %ymm0
> + vpermq $68, a(%rip), %ymm0
>
> The above does not list all changes, I've been often ignoring further changes
> in the file if say one change adds or removes a .LC*, then everything else is
> renumbered (and doesn't sometimes list cases where the same or similar change
> appears with multiple ISAs). So the results are clearly mixed.
>
> Perhaps I should just try doing this at the end of expand_vec_perm_1 (i.e. if
> we (most likely) couldn't get a single insn normally, see if we would get it
> otherwise), and at the end of ix86_expand_vec_perm_const_1 (as the fallback
> after all sequences).
Yeah, I would have done it only if we fail to permute, not generally.
I think you need to stop at 16 byte boundaries (TImode) only for AVX256
and 32byte (OImode) for AVX512. Not sure if there are cases where
a "effective" DImode permute works with SImode but not DImode,
say { 4, 5, 6, 7, 0, 1, 2, 3 } HImode can be done with both an
SImode { 2, 3, 0, 1 } or a DImode { 1, 0 } permute.
> It won't catch some beneficial one insn to one insn
> changes (e.g. where in the original case the insn needs a constant operand in
> memory) though.
True. I fear that at some point we want a generator covering all
possible permutes using permute patterns (input would be the .md
file and a list of insns to consider - or maybe even autodetect those).
The code handling permutation is already quite unwieldly (and it tries
generating RTL ...) :/
More information about the Gcc-bugs
mailing list