This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/88013] can't vectorize rgb to grayscale conversion code
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 14 Nov 2018 08:46:25 +0000
- Subject: [Bug target/88013] can't vectorize rgb to grayscale conversion code
- Auto-submitted: auto-generated
- References: <bug-88013-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Target| |arm
Blocks| |53947
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
On x86_64 we manage to vectorize this with quite absymal code (for core-avx2)
with a vectorization factor of 32:
.L4:
vmovdqu (%rax), %ymm1
vmovdqu 64(%rax), %ymm4
addq $32, %rcx
addq $96, %rax
vmovdqu -64(%rax), %ymm5
vpshufb %ymm14, %ymm1, %ymm0
vpermq $78, %ymm0, %ymm2
vpshufb %ymm13, %ymm1, %ymm0
vpshufb %ymm12, %ymm5, %ymm3
vpor %ymm2, %ymm0, %ymm0
vpshufb %ymm11, %ymm4, %ymm2
vpor %ymm3, %ymm0, %ymm0
vpermq $78, %ymm2, %ymm3
vpshufb .LC5(%rip), %ymm4, %ymm2
vpshufb .LC4(%rip), %ymm0, %ymm0
vpor %ymm3, %ymm2, %ymm2
vpshufb .LC6(%rip), %ymm1, %ymm3
vpermq $78, %ymm3, %ymm15
vpor %ymm2, %ymm0, %ymm0
vpshufb .LC7(%rip), %ymm1, %ymm3
vpshufb .LC8(%rip), %ymm5, %ymm2
vpor %ymm15, %ymm3, %ymm3
vpshufb .LC11(%rip), %ymm4, %ymm15
vpshufb .LC14(%rip), %ymm5, %ymm5
vpor %ymm2, %ymm3, %ymm3
vpshufb .LC9(%rip), %ymm4, %ymm2
vpermq $78, %ymm2, %ymm2
vpshufb %ymm10, %ymm3, %ymm3
vpor %ymm2, %ymm15, %ymm2
vpor %ymm2, %ymm3, %ymm3
vpshufb .LC12(%rip), %ymm1, %ymm2
vpshufb .LC13(%rip), %ymm1, %ymm1
vpermq $78, %ymm2, %ymm2
vpor %ymm2, %ymm1, %ymm2
vpshufb .LC15(%rip), %ymm4, %ymm1
vpshufb .LC16(%rip), %ymm4, %ymm4
vpermq $78, %ymm1, %ymm1
vpor %ymm5, %ymm2, %ymm2
vpor %ymm1, %ymm4, %ymm4
vpshufb %ymm10, %ymm2, %ymm2
vpmovzxbw %xmm0, %ymm1
vpor %ymm4, %ymm2, %ymm2
vpmovzxbw %xmm3, %ymm4
vextracti128 $0x1, %ymm0, %xmm0
vpmullw %ymm7, %ymm4, %ymm4
vpmullw %ymm8, %ymm1, %ymm1
vextracti128 $0x1, %ymm3, %xmm3
vpmovzxbw %xmm0, %ymm0
vpmovzxbw %xmm3, %ymm3
vpmullw %ymm8, %ymm0, %ymm0
vpmullw %ymm7, %ymm3, %ymm3
vpaddw %ymm4, %ymm1, %ymm1
vpmovzxbw %xmm2, %ymm4
vextracti128 $0x1, %ymm2, %xmm2
vpmovzxbw %xmm2, %ymm2
vpmullw %ymm6, %ymm4, %ymm4
vpmullw %ymm6, %ymm2, %ymm2
vpaddw %ymm3, %ymm0, %ymm0
vpaddw %ymm4, %ymm1, %ymm1
vpaddw %ymm2, %ymm0, %ymm0
vpsrlw $8, %ymm1, %ymm1
vpsrlw $8, %ymm0, %ymm0
vpand %ymm1, %ymm9, %ymm1
vpand %ymm0, %ymm9, %ymm0
vpackuswb %ymm0, %ymm1, %ymm0
vpermq $216, %ymm0, %ymm0
vmovdqu %ymm0, -32(%rcx)
cmpq %r8, %rcx
jne .L4
Maybe you can post what you think arm can do better here?
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations