[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask
thiago at kde dot org
gcc-bugzilla@gcc.gnu.org
Fri Feb 22 21:59:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445
--- Comment #7 from Thiago Macieira <thiago at kde dot org> ---
Comment on attachment 45800
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45800
gcc9-pr89445.patch
Tested and works on my machine.
The movzbl that GCC 8 generated is also gone, but it inserted moves *from* the
OpMask register:
.L4:
movq %rcx, %rax
addq $64, %rcx
cmpq %rdi, %rcx
kmovw %k1, %r9d
cmova %r8d, %r9d
kmovw %r9d, %k1
vmovupd (%rsi,%rax), %zmm1{%k1}{z}
addq %rdx, %rax
vmovupd (%rax), %zmm2{%k1}{z}
vfmadd132pd %zmm0, %zmm2, %zmm1
vmovupd %zmm1, (%rax){%k1}
cmpq %rdi, %rcx
jb .L4
Seems like it forgot the GPR that used to contain the mask, so it needed to
reload from %k1. The end detection is also slightly worse.
Yesterday, when I benchmarked with GCC 8, it ran 1000 iterations over 10
million doubles in roughly 11.9 ms, with 10 million instructions. Today, I am
getting 11.8 ms at 16 million instructions (the increase of instructions/cycle
is roughly equal to the decrease in instructions per iteration, proving that
memory bandwidth is the bottleneck)
More information about the Gcc-bugs
mailing list