[Bug target/89445] New: [8 regression] _mm512_maskz_loadu_pd "forgets" to use the mask
thiago at kde dot org
gcc-bugzilla@gcc.gnu.org
Fri Feb 22 01:54:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445
Bug ID: 89445
Summary: [8 regression] _mm512_maskz_loadu_pd "forgets" to use
the mask
Product: gcc
Version: 9.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: thiago at kde dot org
Target Milestone: ---
Created attachment 45793
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45793&action=edit
example showing segmentation fault
In the following code:
void daxpy(size_t n, double a, double const* __restrict x, double* __restrict
y)
{
const __m512d v_a = _mm512_broadcastsd_pd(_mm_set_sd(a));
const __mmask16 final = (1U << (n % 8u)) - 1;
__mmask16 mask = 65535u;
for (size_t i = 0; i < n * sizeof(double); i += 8 * sizeof(double)) {
if (i + 8 * sizeof(double) > n * sizeof(double))
mask = final;
__m512d v_x = _mm512_maskz_loadu_pd(mask, (char const *)x + i);
__m512d v_y = _mm512_maskz_loadu_pd(mask, (char const *)y + i);
__m512d tmp = _mm512_fmadd_pd(v_x, v_a, v_y);
_mm512_mask_storeu_pd((char *)y + i, mask, tmp);
}
}
When compiled with GCC 8, the loop looks like
.L5:
cmpq %rax, %r10
cmovb %r9d, %r8d
movzbl %r8b, %ecx
kmovd %ecx, %k1
leaq (%rdx,%rax), %rcx
vmovapd (%rsi,%rax), %zmm1{%k1}{z}
vmovapd (%rcx), %zmm2{%k1}{z}
vfmadd132pd %zmm0, %zmm2, %zmm1
vmovupd %zmm1, (%rcx){%k1}
addq $64, %rax
cmpq %rdi, %rax
jb .L5
Whereas GCC trunk (as of r269073) generates:
.L5:
vmovapd (%rsi,%rax), %zmm1
cmpq %rax, %r9
vfmadd213pd (%rdx,%rax), %zmm0, %zmm1
cmovb %r8d, %ecx
kmovb %ecx, %k1
vmovupd %zmm1, (%rdx,%rax){%k1}
addq $64, %rax
cmpq %rdi, %rax
jb .L5
Godbolt link: https://gcc.godbolt.org/z/2ys7ZO
Since the neither memory loads are masked, the resulting registers can contain
garbage and trigger FP exceptions. They can also cause segmentation faults if
portions of the source are not mapped regions. The attached example forces the
operation on a page boundary where half the 64 bytes addressed by the second
load are unmapped. When run, the example will crash.
More information about the Gcc-bugs
mailing list