85328 – [8 Regression] accessing ymm16 with non-avx512 instruction form

Bug 85328 - [8 Regression] accessing ymm16 with non-avx512 instruction form

Summary: [8 Regression] accessing ymm16 with non-avx512 instruction form

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.0.1

Importance:	P1 normal
Target Milestone:	8.0
Assignee:	Jakub Jelinek

URL:
Keywords:	assemble-failure

Depends on:
Blocks:

Reported:	2018-04-10 18:56 UTC by Zdenek Sojka
Modified:	2018-04-12 12:30 UTC (History)
CC List:	1 user (show)

See Also:
Host:	x86_64-pc-linux-gnu
Target:	x86_64-pc-linux-gnu
Build:
Known to work:
Known to fail:	8.0.1
Last reconfirmed:	2018-04-10 00:00:00

Attachments
reduced testcase (148 bytes, text/plain) 2018-04-10 18:56 UTC, Zdenek Sojka	Details
gcc8-pr85328.patch (1.13 KB, patch) 2018-04-11 10:16 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zdenek Sojka 2018-04-10 18:56:53 UTC

Created attachment 43902 [details]
reduced testcase

Compiler output:
$ x86_64-pc-linux-gnu-gcc -O3 -fno-caller-saves -mavx512f testcase.c
/tmp/ccNXhuu3.s: Assembler messages:
/tmp/ccNXhuu3.s:418: Error: unsupported instruction `vpand'

The failing instruction is:
...
	vpand	%ymm16, %ymm1, %ymm1
...

I tried both most recent GNU as and NASM (both from most recent GIT), but neither accept this form. Accessing ymm >= 16 is allowed only with the EVEX-prefixed VPANDD or VPANDQ.

I am really not an expert in AVX512 instruction encoding, but gcc seems to be wrong here according to the assemblers I tested and Intel's SDM vol.2:
only vpandd/vpandq are using the EVEX prefix, and you can use ymm >= 16 only with the EVEX prefix.

(this can be generalized to other instructions, such as vpor, and maybe to other registers, such as xmm/zmm)

$ x86_64-pc-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=/repo/gcc-trunk/binary-latest-amd64/bin/x86_64-pc-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/repo/gcc-trunk/binary-trunk-259207-checking-yes-rtl-df-extra-nobootstrap-pr85177-amd64/bin/../libexec/gcc/x86_64-pc-linux-gnu/8.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /repo/gcc-trunk//configure --enable-languages=c,c++ --enable-valgrind-annotations --disable-nls --enable-checking=yes,rtl,df,extra --disable-bootstrap --with-cloog --with-ppl --with-isl --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --with-ld=/usr/bin/x86_64-pc-linux-gnu-ld --with-as=/usr/bin/x86_64-pc-linux-gnu-as --disable-libstdcxx-pch --prefix=/repo/gcc-trunk//binary-trunk-259207-checking-yes-rtl-df-extra-nobootstrap-pr85177-amd64
Thread model: posix
gcc version 8.0.1 20180407 (experimental) (GCC)

Comment 1 Jakub Jelinek 2018-04-10 19:45:42 UTC

Started with r250759.  Debugging.

Comment 2 Jakub Jelinek 2018-04-11 10:16:57 UTC

Created attachment 43907 [details]
gcc8-pr85328.patch

Many patterns rely on ix86_hard_regno_mode_ok not allowing < 512-bit vector modes in xmm16+ registers.  Unfortunately, the vec_extract_lo_* splitters provide a loophole for this, by creating e.g. on this testcase V32QImode xmm16 hard register which then is propagated into the vpand.  The patch fixes that by avoiding that, essentially forcing the low half or quarter vector extraction from the zmm16+ registers to be a 512-bit move into the other register (which must be necessarily < xmm16.

Comment 3 Jakub Jelinek 2018-04-12 11:17:55 UTC

Author: jakub
Date: Thu Apr 12 11:17:23 2018
New Revision: 259344

URL: https://gcc.gnu.org/viewcvs?rev=259344&root=gcc&view=rev
Log:
	PR target/85328
	* config/i386/sse.md
	(<mask_codefor>avx512dq_vextract<shuffletype>64x2_1<mask_name> split,
	<mask_codefor>avx512f_vextract<shuffletype>32x4_1<mask_name> split,
	vec_extract_lo_<mode><mask_name> split, vec_extract_lo_v32hi,
	vec_extract_lo_v64qi): For non-AVX512VL if input is xmm16+ reg
	and output is a reg, avoid creating invalid lowpart subreg, but
	instead split into a 512-bit move.  Don't split if not AVX512VL,
	input is xmm16+ reg and output is a mem.
	(vec_extract_lo_<mode><mask_name>, vec_extract_lo_v32hi,
	vec_extract_lo_v64qi): Don't require split if not AVX512VL, input is
	xmm16+ reg and output is a mem.

	* gcc.target/i386/pr85328.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85328.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog

Comment 4 Jakub Jelinek 2018-04-12 12:30:23 UTC

Fixed.