[Bug target/89445] New: [8 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

thiago at kde dot org gcc-bugzilla@gcc.gnu.org
Fri Feb 22 01:54:00 GMT 2019


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445

            Bug ID: 89445
           Summary: [8 regression] _mm512_maskz_loadu_pd "forgets" to use
                    the mask
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org
  Target Milestone: ---

Created attachment 45793
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45793&action=edit
example showing segmentation fault

In the following code:

void daxpy(size_t n, double a, double const* __restrict x,  double* __restrict
y)
{
    const __m512d v_a = _mm512_broadcastsd_pd(_mm_set_sd(a));

    const __mmask16 final = (1U << (n % 8u)) - 1;
    __mmask16 mask = 65535u;
    for (size_t i = 0; i < n * sizeof(double); i += 8 * sizeof(double)) {
        if (i + 8 * sizeof(double) > n * sizeof(double))
            mask = final;
        __m512d v_x = _mm512_maskz_loadu_pd(mask, (char const *)x + i);
        __m512d v_y = _mm512_maskz_loadu_pd(mask, (char const *)y + i);
        __m512d tmp = _mm512_fmadd_pd(v_x, v_a, v_y);
        _mm512_mask_storeu_pd((char *)y + i, mask, tmp);
    }
}

When compiled with GCC 8, the loop looks like

.L5:
        cmpq    %rax, %r10
        cmovb   %r9d, %r8d
        movzbl  %r8b, %ecx
        kmovd   %ecx, %k1
        leaq    (%rdx,%rax), %rcx
        vmovapd (%rsi,%rax), %zmm1{%k1}{z}
        vmovapd (%rcx), %zmm2{%k1}{z}
        vfmadd132pd     %zmm0, %zmm2, %zmm1
        vmovupd %zmm1, (%rcx){%k1}
        addq    $64, %rax
        cmpq    %rdi, %rax
        jb      .L5

Whereas GCC trunk (as of r269073) generates:

.L5:
        vmovapd (%rsi,%rax), %zmm1
        cmpq    %rax, %r9
        vfmadd213pd     (%rdx,%rax), %zmm0, %zmm1
        cmovb   %r8d, %ecx
        kmovb   %ecx, %k1
        vmovupd %zmm1, (%rdx,%rax){%k1}
        addq    $64, %rax
        cmpq    %rdi, %rax
        jb      .L5

Godbolt link: https://gcc.godbolt.org/z/2ys7ZO

Since the neither memory loads are masked, the resulting registers can contain
garbage and trigger FP exceptions. They can also cause segmentation faults if
portions of the source are not mapped regions. The attached example forces the
operation on a page boundary where half the 64 bytes addressed by the second
load are unmapped. When run, the example will crash.


More information about the Gcc-bugs mailing list