91994 – [10 Regression] r276327 breaks -mvzeroupper

Bug 91994 - [10 Regression] r276327 breaks -mvzeroupper

Summary: [10 Regression] r276327 breaks -mvzeroupper

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	10.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Sandiford

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-10-04 18:02 UTC by H.J. Lu
Modified:	2019-10-18 08:35 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-10-04 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2019-10-04 18:02:26 UTC

On x86-64, r276327 miscompiled 557.xz_r in SPEC CPU 2017 with

-fno-unsafe-math-optimizations -mfpmath=sse  -march=skylake   -Ofast -funroll-loops

hjl@gnu-skx-1 run_peak_refrate_gcc_native-m64.0000]$ cat cpu2006docs.tar-250-6e.out.mis
0011:  Uncompressed data 262144000 bytes in length
       Decompression decoder error: Compressed data is corrupt (code 9)
       ^
0012:  Uncompressed data compared correctly
       Uncompressed data 262144000 bytes in length
                         ^
0013:  Tested 250 MiB buffer: OK!
       Uncompressed data compared correctly
       ^
'cpu2006docs.tar-250-6e.out' long
[hjl@gnu-skx-1 run_peak_refrate_gcc_native-m64.0000]$

Comment 1 H.J. Lu 2019-10-04 18:40:25 UTC

liblzma/lzma/lzma_encoder.c and/or liblzma/check/sha256.c are miscompiled.
The good one has

callq  0 <length_update_prices>
cmp    %r13,%r14
vmovdqa XXX(%rip),%ymm0
jne
...
vmovdqu %ymm0,XXX(%rbx)

The bad one has

callq  0 <length_update_prices>
cmp    %r13,%r14
jne
vmovdqu %ymm0,XXX(%rbx)

length_update_prices clobbers %ymm0, which isn't restored after call.

Comment 2 H.J. Lu 2019-10-04 20:19:15 UTC

liblzma/check/sha256.c is miscompiled.

Comment 3 H.J. Lu 2019-10-04 20:55:49 UTC

lzma_sha256_update in sha256.c is miscompiled with -O3 -march=skylake.
Correct code:

L42:
...
        vpshufb %ymm7, %ymm1, %ymm0
        vmovdqa %ymm0, (%rsp)
        leaq    64(%r13), %rdi
        vpshufb %ymm7, %ymm2, %ymm0
        movq    %rsp, %rsi
        vmovdqa %ymm0, 32(%rsp)
        call    transform
        vmovdqa .LC0(%rip), %ymm7  <<< This is missing.
.L49:
        testq   %rbx, %rbx
        je      .L69
.L50:
...
        cmpl    $32, %ecx
        jb      .L65

Comment 4 H.J. Lu 2019-10-04 21:47:56 UTC

(In reply to H.J. Lu from comment #3)
> lzma_sha256_update in sha256.c is miscompiled with -O3 -march=skylake.
> Correct code:
> 
> L42:
> ...
>         vpshufb %ymm7, %ymm1, %ymm0
>         vmovdqa %ymm0, (%rsp)
>         leaq    64(%r13), %rdi
>         vpshufb %ymm7, %ymm2, %ymm0
>         movq    %rsp, %rsi
>         vmovdqa %ymm0, 32(%rsp)
>         call    transform
>         vmovdqa .LC0(%rip), %ymm7  <<< This is missing.
> .L49:
>         testq   %rbx, %rbx
>         je      .L69
> .L50:
> ...
>         cmpl    $32, %ecx
>         jb      .L65

vzeroupper clears the upper bits of %ymm7 when transform returns.  If
-mzeroupper is used, upper bits of vector registers are clobbered upon
callee return if any YMM/ZMM registers are used in callee.

Comment 5 H.J. Lu 2019-10-04 22:42:29 UTC

[hjl@gnu-skx-1 gcc]$ cat bad.c
#include <stdlib.h>
#include <immintrin.h>

__m256i x1, x2, x3;

__attribute__ ((noinline))
static void
foo (void)
{
  x1 = x2;
}

void
bar (void)
{
  __m256i x = x1;
  foo ();
  x3 = x;
}

__attribute__ ((noinline))
int
main (void)
{
  __m256i x = _mm256_set1_epi8 (3);
  x1 = x;
  bar ();
  if (__builtin_memcmp (&x3, &x, sizeof (x)))
    abort ();
  return 0;
}
[hjl@gnu-skx-1 gcc]$ ./xgcc -B./ -march=skylake  -O2  bad.c 
./a[hjl@gnu-skx-1 gcc]$ ./a.out 
Aborted
[hjl@gnu-skx-1 gcc]$ ./xgcc -B./ -march=skylake  -O2  bad.c -S
[hjl@gnu-skx-1 gcc]$ cat bad.s
	.file	"bad.c"
	.text
	.p2align 4
	.type	foo, @function
foo:
.LFB5339:
	.cfi_startproc
	vmovdqa	x2(%rip), %ymm0
	vmovdqa	%ymm0, x1(%rip)
	vzeroupper <<< Clobber the upper bits of YMM1.
	ret
	.cfi_endproc
.LFE5339:
	.size	foo, .-foo
	.p2align 4
	.globl	bar
	.type	bar, @function
bar:
.LFB5340:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	vmovdqa	x1(%rip), %ymm1
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	andq	$-32, %rsp
	call	foo
	vmovdqa	%ymm1, x3(%rip)
	vzeroupper
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE5340:
	.size	bar, .-bar
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB5341:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movabsq	$217020518514230019, %rax
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	andq	$-32, %rsp
	subq	$32, %rsp
	vmovdqa	.LC0(%rip), %ymm1
	vmovdqa	%ymm1, (%rsp)
	vmovdqa	%ymm1, x1(%rip)
	call	foo
	vmovdqa	%ymm1, x3(%rip)
	movq	x3+8(%rip), %rdx
	xorq	(%rsp), %rax
	xorq	8(%rsp), %rdx
	orq	%rax, %rdx
	jne	.L6
	movq	x3+24(%rip), %rdx
	movq	x3+16(%rip), %rax
	xorq	24(%rsp), %rdx
	xorq	16(%rsp), %rax
	orq	%rax, %rdx
	je	.L9
.L6:
	vzeroupper
	call	abort
	.p2align 4,,10
	.p2align 3
.L9:
	xorl	%eax, %eax
	vzeroupper
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE5341:
	.size	main, .-main
	.comm	x3,32,32
	.comm	x2,32,32
	.comm	x1,32,32
	.section	.rodata.cst32,"aM",@progbits,32
	.align 32
.LC0:
	.quad	217020518514230019
	.quad	217020518514230019
	.quad	217020518514230019
	.quad	217020518514230019
	.ident	"GCC: (GNU) 10.0.0 20191003 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 gcc]$

Comment 6 Richard Sandiford 2019-10-05 13:34:52 UTC

Mine.  See https://gcc.gnu.org/ml/gcc-patches/2019-10/msg00416.html for some thoughts about how to fix this.

Comment 7 Richard Sandiford 2019-10-07 08:36:37 UTC

Author: rsandifo
Date: Mon Oct  7 08:36:06 2019
New Revision: 276648

URL: https://gcc.gnu.org/viewcvs?rev=276648&root=gcc&view=rev
Log:
[i386] Make the vzeroupper pattern describe its effects (PR91994)

The problem in this PR was that vzeroupper has an effect on register
contents, but those effects weren't modelled in the rtl pattern,
which was just an unspec_volatile.

This patch fixes that by running a subpass after vzeroupper insertion
to add SETs and CLOBBERs as appropriate.  See the comments in the patch
for more details.

2019-10-07  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	PR target/91994
	* config/i386/sse.md (avx_vzeroupper): Turn into a define_expand
	and wrap the unspec_volatile in a parallel.
	(*avx_vzeroupper): New define_insn.  Use a match_parallel around
	the unspec_volatile.
	* config/i386/predicates.md (vzeroupper_pattern): Expect the
	unspec_volatile to be wrapped in a parallel.
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper)
	(ix86_add_reg_usage_to_vzerouppers): New functions.
	(rest_of_handle_insert_vzeroupper): Use them to add register
	usage information to the vzeroupper instructions.

gcc/testsuite/
	PR target/91994
	* gcc.target/i386/pr91994.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr91994.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-features.c
    trunk/gcc/config/i386/predicates.md
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog

Comment 8 Richard Sandiford 2019-10-07 08:40:52 UTC

Fixed for the reduced testcase.  Please reopen if there's still a problem with the SPEC test itself.

Comment 9 Uroš Bizjak 2019-10-07 13:36:24 UTC

(In reply to rsandifo@gcc.gnu.org from comment #8)
> Fixed for the reduced testcase.  Please reopen if there's still a problem
> with the SPEC test itself.

Please note that when the testcase from the comment #5 is compiled with "-march=skylake -O2 -mavx512f", then a vzeroupper before the call to "foo" is now missing:

bar:
        pushq   %rbp
        movq    %rsp, %rbp
        andq    $-32, %rsp
        subq    $32, %rsp
        vmovdqa x1(%rip), %ymm0
        vmovdqa %ymm0, (%rsp)
        call    foo
        vmovdqa (%rsp), %ymm0
        vmovdqa %ymm0, x3(%rip)
        vzeroupper
        leave
        ret

gcc-9.2.1 compiles the function to:

bar:
        pushq   %rbp
        movq    %rsp, %rbp
        andq    $-32, %rsp
        subq    $32, %rsp
        vmovdqa x1(%rip), %ymm1
        vmovdqa %ymm1, (%rsp)
        vzeroupper                      <---- here
        call    foo
        vmovdqa (%rsp), %ymm1
        vmovdqa %ymm1, x3(%rip)
        vzeroupper
        leave
        ret

(I would also expect that %ymm 16+ is uses as a temporary, as it is not clobbered by a vzeroupper in "foo").

Comment 10 Uroš Bizjak 2019-10-07 18:47:16 UTC

Richard, since vzeroupper clobbers only xmm0-xmm15 (xmm0-xmm7 on 32it targets), shouldn't we use SSE_REGS instead of ALL_SSE_REGS here:

Index: i386.c
===================================================================
--- i386.c      (revision 276660)
+++ i386.c      (working copy)
@@ -13530,7 +13530,7 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
         modes wider than 256 bits.  It's only safe to issue a
         vzeroupper if all SSE registers are clobbered.  */
       const function_abi &abi = insn_callee_abi (insn);
-      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
+      if (!hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
                                  abi.mode_clobbers (V4DImode)))
        return AVX_U128_ANY;

Comment 11 Richard Sandiford 2019-10-07 19:15:33 UTC

"ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org> writes:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91994
>
> --- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> ---
> Richard, since vzeroupper clobbers only xmm0-xmm15 (xmm0-xmm7 on 32it targets),
> shouldn't we use SSE_REGS instead of ALL_SSE_REGS here:
>
> Index: i386.c
> ===================================================================
> --- i386.c      (revision 276660)
> +++ i386.c      (working copy)
> @@ -13530,7 +13530,7 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
>          modes wider than 256 bits.  It's only safe to issue a
>          vzeroupper if all SSE registers are clobbered.  */
>        const function_abi &abi = insn_callee_abi (insn);
> -      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
> +      if (!hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
>                                   abi.mode_clobbers (V4DImode)))
>         return AVX_U128_ANY;

Ah, yeah.  LGTM, thanks.

Comment 12 uros 2019-10-08 17:02:28 UTC

Author: uros
Date: Tue Oct  8 17:01:55 2019
New Revision: 276707

URL: https://gcc.gnu.org/viewcvs?rev=276707&root=gcc&view=rev
Log:
	PR target/91994
	* config/i386/i386.c (x86_avx_u128_mode_needed): Use SSE_REG
	instead of ALL_SSE_REG to check if function call preserves some
	256-bit SSE registers.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c

Comment 13 Richard Sandiford 2019-10-14 21:33:32 UTC

Is this still a problem?  I think Uros's patch fixed the problem
mentioned in comment #9.

Comment 14 Richard Sandiford 2019-10-18 08:35:41 UTC

Assuming fixed.