[Bug target/68655] SSE2 cannot vec_perm of low and high part

Thu Dec 3 12:53:00 GMT 2015

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655

--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 3 Dec 2015, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655
> 
> --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> I guess it needs analysis.
> Some examples of changes:
> vshuf-v16qi.c -msse2 test_2, scalar code vs. punpcklqdq, clear win
> vshuf-v16qi.c -msse4 test_2, pshufb -> punpcklqdq (is this a win or not?)
> (similarly for -mavx, -mavx2, -mavx512f, -mavx512bw)
> vshuf-v16si.c -mavx512{f,bw} test_2:
> -       vpermi2d        %zmm1, %zmm1, %zmm0
> +       vmovdqa64       .LC2(%rip), %zmm0
> +       vpermi2q        %zmm1, %zmm1, %zmm0
> looks like pessimization.
> vshuf-v32hi.c -mavx512bw test_2, similar pessimization.
> vshuf-v32hi.c -mavx512bw test_2, similarly:
> -       vpermi2w        %zmm1, %zmm1, %zmm0
> +       vmovdqa64       .LC2(%rip), %zmm0
> +       vpermi2q        %zmm1, %zmm1, %zmm0
> vshuf-v4si.c -msse2 test_183, another pessimization:
> -       pshufd  $78, %xmm0, %xmm1
> +       movdqa  %xmm0, %xmm1
>         movd    b(%rip), %xmm4
>         pshufd  $255, %xmm0, %xmm2
> +       shufpd  $1, %xmm0, %xmm1
> vshuf-v4si.c -msse4 test_183, another pessimization:
> -       pshufd  $78, %xmm1, %xmm0
> +       movdqa  %xmm1, %xmm0
> +       palignr $8, %xmm0, %xmm0
> vshuf-v4si.c -mavx test_183:
> -       vpshufd $78, %xmm1, %xmm0
> +       vpalignr        $8, %xmm1, %xmm1, %xmm0
> vshuf-v64qi.c -mavx512bw, desirable change:
> -       vpermi2w        %zmm1, %zmm1, %zmm0
> -       vpshufb .LC3(%rip), %zmm0, %zmm1
> -       vpshufb .LC4(%rip), %zmm0, %zmm0
> -       vporq   %zmm0, %zmm1, %zmm0
> +       vpermi2q        %zmm1, %zmm1, %zmm0
> vshuf-v8hi.c -msse2 test_1 another scalar to punpcklqdq, win
> vshuf-v8hi.c -msse4 test_2 (supposedly a win):
> -       pshufb  .LC3(%rip), %xmm0
> +       punpcklqdq      %xmm0, %xmm0
> vshuf-v8hi.c -mavx test_2, similarly:
> -       vpshufb .LC3(%rip), %xmm0, %xmm0
> +       vpunpcklqdq     %xmm0, %xmm0, %xmm0
> vshuf-v8si.c -mavx2 test_2, another win:
> -       vmovdqa a(%rip), %ymm0
> -       vperm2i128      $0, %ymm0, %ymm0, %ymm0
> +       vpermq  $68, a(%rip), %ymm0
> vshuf-v8si.c -mavx2 test_5, another win:
> -       vmovdqa .LC6(%rip), %ymm0
> -       vmovdqa .LC7(%rip), %ymm1
> -       vmovdqa %ymm0, -48(%rbp)
>         vmovdqa a(%rip), %ymm0
> -       vpermd  %ymm0, %ymm1, %ymm1
> -       vpshufb .LC8(%rip), %ymm0, %ymm3
> -       vpshufb .LC10(%rip), %ymm0, %ymm0
> -       vmovdqa %ymm1, c(%rip)
> -       vmovdqa b(%rip), %ymm1
> -       vpermq  $78, %ymm3, %ymm3
> -       vpshufb .LC9(%rip), %ymm1, %ymm2
> -       vpshufb .LC11(%rip), %ymm1, %ymm1
> -       vpor    %ymm3, %ymm0, %ymm0
> -       vpermq  $78, %ymm2, %ymm2
> -       vpor    %ymm2, %ymm1, %ymm1
> -       vpor    %ymm1, %ymm0, %ymm0
> +       vmovdqa .LC7(%rip), %ymm2
> +       vmovdqa .LC6(%rip), %ymm1
> +       vpermd  %ymm0, %ymm2, %ymm2
> +       vpermd  b(%rip), %ymm1, %ymm3
> +       vmovdqa %ymm1, -48(%rbp)
> +       vmovdqa %ymm2, c(%rip)
> +       vpermd  %ymm0, %ymm1, %ymm0
> +       vmovdqa .LC8(%rip), %ymm2
> +       vpand   %ymm2, %ymm1, %ymm1
> +       vpcmpeqd        %ymm2, %ymm1, %ymm1
> +       vpblendvb       %ymm1, %ymm3, %ymm0, %ymm0
> vshuf-v8si.c -mavx512f test_2, another win?
> -       vmovdqa a(%rip), %ymm0
> -       vperm2i128      $0, %ymm0, %ymm0, %ymm0
> +       vpermq  $68, a(%rip), %ymm0
> 
> The above does not list all changes, I've been often ignoring further changes
> in the file if say one change adds or removes a .LC*, then everything else is
> renumbered (and doesn't sometimes list cases where the same or similar change
> appears with multiple ISAs). So the results are clearly mixed.
> 
> Perhaps I should just try doing this at the end of expand_vec_perm_1 (i.e. if
> we (most likely) couldn't get a single insn normally, see if we would get it
> otherwise), and at the end of ix86_expand_vec_perm_const_1 (as the fallback
> after all sequences).

Yeah, I would have done it only if we fail to permute, not generally.
I think you need to stop at 16 byte boundaries (TImode) only for AVX256
and 32byte (OImode) for AVX512.  Not sure if there are cases where
a "effective" DImode permute works with SImode but not DImode,
say { 4, 5, 6, 7, 0, 1, 2, 3 } HImode can be done with both an
SImode { 2, 3, 0, 1 } or a DImode { 1, 0 } permute.

> It won't catch some beneficial one insn to one insn
> changes (e.g. where in the original case the insn needs a constant operand in
> memory) though.

True.  I fear that at some point we want a generator covering all
possible permutes using permute patterns (input would be the .md
file and a list of insns to consider - or maybe even autodetect those).

The code handling permutation is already quite unwieldly (and it tries
generating RTL ...) :/