[Bug rtl-optimization/77287] Much worse code generated compared to clang (stack alignment and spills)
kobalicek.petr at gmail dot com
gcc-bugzilla@gcc.gnu.org
Sat Aug 20 18:10:00 GMT 2016
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287
--- Comment #4 from Petr <kobalicek.petr at gmail dot com> ---
Adding -fschedule-insns is definitely a huge improvement in this case. I wonder
why this doesn't happen by default at -O2 and -Os, as it really improves things
and makes shorter output, or it's just in this particular case?
Here is the assembly produced by gcc with -fschedule-insns:
push ebp
mov ebp, esp
and esp, -32
lea esp, [esp-32]
mov ecx, DWORD PTR [ebp+8]
mov edx, DWORD PTR [ebp+32]
mov eax, DWORD PTR [ebp+36]
vmovdqu ymm5, YMMWORD PTR [ecx]
mov ecx, DWORD PTR [ebp+12]
vmovdqu ymm3, YMMWORD PTR [edx]
vmovdqu ymm6, YMMWORD PTR [eax]
vmovdqu ymm2, YMMWORD PTR [ecx]
mov ecx, DWORD PTR [ebp+28]
vpackuswb ymm7, ymm2, ymm3
vpaddw ymm7, ymm7, ymm2
vpsubw ymm7, ymm7, ymm3
vmovdqu ymm4, YMMWORD PTR [ecx]
mov ecx, DWORD PTR [ebp+16]
vpackuswb ymm0, ymm5, ymm4
vpaddw ymm0, ymm0, ymm5
vpsubw ymm0, ymm0, ymm4
vmovdqu ymm1, YMMWORD PTR [ecx]
vpackuswb ymm0, ymm0, ymm7
mov ecx, DWORD PTR [ebp+20]
vpackuswb ymm2, ymm1, ymm6
vmovdqu ymm4, YMMWORD PTR [edx+32]
vpaddw ymm1, ymm2, ymm1
mov edx, DWORD PTR [ebp+24]
vpsubw ymm1, ymm1, ymm6
vmovdqu ymm5, YMMWORD PTR [ecx]
vpackuswb ymm0, ymm0, ymm1
vpackuswb ymm3, ymm5, ymm4
vmovdqa YMMWORD PTR [esp], ymm3
vmovdqu ymm2, YMMWORD PTR [eax+32] ; LOOK HERE
vpaddw ymm5, ymm5, YMMWORD PTR [esp]
vmovdqu ymm3, YMMWORD PTR [edx] ; AND HERE
vpsubw ymm4, ymm5, ymm4
vpackuswb ymm7, ymm3, ymm2
vpackuswb ymm0, ymm0, ymm4
vpaddw ymm3, ymm7, ymm3
vpsubw ymm2, ymm3, ymm2
vpackuswb ymm2, ymm0, ymm2
vpextrd eax, xmm2, 1
vzeroupper
leave
ret
Which is pretty close to clang already, however, look at this part:
vmovdqa YMMWORD PTR [esp], ymm3 ; Spill YMM3
vmovdqu ymm2, YMMWORD PTR [eax+32]
vpaddw ymm5, ymm5, YMMWORD PTR [esp] ; Mem instead of YMM3?
vmovdqu ymm3, YMMWORD PTR [edx] ; Old YMM3 becomes dead here
The spill is completely unnecessary in our case, and it's the only reason why
the prolog/epilog requires code to perform dynamic stack alignment. I mean if
this one thing is eliminated then GCC basically generates a comparable code to
clang.
But thanks for -fschedule-insns hint, I didn't know about it.
More information about the Gcc-bugs
mailing list