Bug 84413

Summary: [8/9 Regression] -mtune=skylake,skylake-avx512,cannonlake,icelake disable many optimizations
Product: gcc Reporter: H.J. Lu <hjl.tools>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: craig.topper, dimhen, gandalf, ubizjak
Priority: P3    
Version: 8.0.1   
Target Milestone: 8.2   
Host: Target: x86
Build: Known to work:
Known to fail: Last reconfirmed: 2018-07-12 00:00:00

Description H.J. Lu 2018-02-15 20:40:16 UTC
[hjl@gnu-skx-1 gcc]$ cat x.c
unsigned long long bextr64_demanded(unsigned long long x)
{
    return x | 0x8000000000;
}
[hjl@gnu-skx-1 gcc]$ ./xgcc -B./ -S -O2 x.c -march=skylake-avx512
[hjl@gnu-skx-1 gcc]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
	.globl	bextr64_demanded
	.type	bextr64_demanded, @function
bextr64_demanded:
.LFB0:
	.cfi_startproc
	movabsq	$549755813888, %rax
	orq	%rdi, %rax
	xorl	%edi, %edi
	ret
	.cfi_endproc
.LFE0:
	.size	bextr64_demanded, .-bextr64_demanded
	.ident	"GCC: (GNU) 8.0.1 20180212 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 gcc]$ ./xgcc -B./ -S -O2 x.c 
[hjl@gnu-skx-1 gcc]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
	.globl	bextr64_demanded
	.type	bextr64_demanded, @function
bextr64_demanded:
.LFB0:
	.cfi_startproc
	movq	%rdi, %rax
	xorl	%edi, %edi
	btsq	$39, %rax
	ret
	.cfi_endproc
.LFE0:
	.size	bextr64_demanded, .-bextr64_demanded
	.ident	"GCC: (GNU) 8.0.1 20180212 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 gcc]$
Comment 1 Uroš Bizjak 2018-02-15 22:20:05 UTC
Besides X86_TUNE_USE_BT, there is probably a long list of flags that have to be enabled for m_SKYLAKE_AVX512 (and m_CANNONLAKE and m_ICELAKE).

Somebody will have to go through all tune flags for the above mentioned targets.
Comment 2 Julia Koval 2018-03-30 07:05:27 UTC
Author: jkoval
Date: Fri Mar 30 07:04:55 2018
New Revision: 258972

URL: https://gcc.gnu.org/viewcvs?rev=258972&root=gcc&view=rev
Log:
Enable tuning options for skylake-avx512.

gcc/
	PR target/84413
	* x86-tune.def (movx, partial_reg_dependency): Enable for
	m_SKYLAKE_AVX512.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune.def
Comment 3 Julia Koval 2018-04-16 06:00:37 UTC
Author: jkoval
Date: Mon Apr 16 05:59:52 2018
New Revision: 259395

URL: https://gcc.gnu.org/viewcvs?rev=259395&root=gcc&view=rev
Log:
Add sse_unaligned_load_optimal and sse_unaligned_store_optimal to Skylake.

gcc/
	PR target/84413
	* config/i386/x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL,
	X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Add m_SKYLAKE_AVX512



Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune.def
Comment 4 Jakub Jelinek 2018-05-02 10:05:56 UTC
GCC 8.1 has been released.
Comment 5 hjl@gcc.gnu.org 2018-07-13 20:26:29 UTC
Author: hjl
Date: Fri Jul 13 20:25:57 2018
New Revision: 262649

URL: https://gcc.gnu.org/viewcvs?rev=262649&root=gcc&view=rev
Log:
x86: Tune Skylake, Cannonlake and Icelake as Haswell

r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations
which are enabled by PROCESSOR_HASWELL.  As the result, -mtune=skylake
generates slower codes on Skylake than before.  The same also applies
to Cannonlake and Icelak tuning.

This patch changes -mtune={skylake|cannonlake|icelake} to tune like
-mtune=haswell for until their tuning is properly adjusted. It also
enables -mprefer-vector-width=256 for -mtune=haswell, which has no
impact on codegen when AVX512 isn't enabled.

Performance impacts on SPEC CPU 2017 rate with 1 copy using

-march=native -mfpmath=sse -O2 -m64

are

1. On Broadwell server:

500.perlbench_r		-0.56%
502.gcc_r		-0.18%
505.mcf_r		0.24%
520.omnetpp_r		0.00%
523.xalancbmk_r		-0.32%
525.x264_r		-0.17%
531.deepsjeng_r		0.00%
541.leela_r		0.00%
548.exchange2_r		0.12%
557.xz_r		0.00%
Geomean			0.00%

503.bwaves_r		0.00%
507.cactuBSSN_r		0.21%
508.namd_r		0.00%
510.parest_r		0.19%
511.povray_r		-0.48%
519.lbm_r		0.00%
521.wrf_r		0.28%
526.blender_r		0.19%
527.cam4_r		0.39%
538.imagick_r		0.00%
544.nab_r		-0.36%
549.fotonik3d_r		0.51%
554.roms_r		0.00%
Geomean			0.17%

On Skylake client:

500.perlbench_r		0.96%
502.gcc_r		0.13%
505.mcf_r		-1.03%
520.omnetpp_r		-1.11%
523.xalancbmk_r		1.02%
525.x264_r		0.50%
531.deepsjeng_r		2.97%
541.leela_r		0.50%
548.exchange2_r		-0.95%
557.xz_r		2.41%
Geomean			0.56%

503.bwaves_r		0.49%
507.cactuBSSN_r		3.17%
508.namd_r		4.05%
510.parest_r		0.15%
511.povray_r		0.80%
519.lbm_r		3.15%
521.wrf_r		10.56%
526.blender_r		2.97%
527.cam4_r		2.36%
538.imagick_r		46.40%
544.nab_r		2.04%
549.fotonik3d_r		0.00%
554.roms_r		1.27%
Geomean			5.49%

On Skylake server:

500.perlbench_r		0.71%
502.gcc_r		-0.51%
505.mcf_r		-1.06%
520.omnetpp_r		-0.33%
523.xalancbmk_r		-0.22%
525.x264_r		1.72%
531.deepsjeng_r		-0.26%
541.leela_r		0.57%
548.exchange2_r		-0.75%
557.xz_r		-1.28%
Geomean			-0.21%

503.bwaves_r		0.00%
507.cactuBSSN_r		2.66%
508.namd_r		3.67%
510.parest_r		1.25%
511.povray_r		2.26%
519.lbm_r		1.69%
521.wrf_r		11.03%
526.blender_r		3.39%
527.cam4_r		1.69%
538.imagick_r		64.59%
544.nab_r		-0.54%
549.fotonik3d_r		2.68%
554.roms_r		0.00%
Geomean			6.19%

This patch improves -march=native performance on Skylake up to 60% and
leaves -march=native performance unchanged on Haswell.

gcc/

2018-07-13  H.J. Lu  <hongjiu.lu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/84413
	* config/i386/i386.c (m_CORE_AVX512): New.
	(m_CORE_AVX2): Likewise.
	(m_CORE_ALL): Add m_CORE_AVX2.
	* config/i386/x86-tune.def: Replace m_HASWELL with m_CORE_AVX2.
	Replace m_SKYLAKE_AVX512 with m_CORE_AVX512 on avx256_optimal
	and remove the rest of m_SKYLAKE_AVX512.

gcc/testsuite/

2018-07-13  H.J. Lu  <hongjiu.lu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/84413
	* gcc.target/i386/pr84413-1.c: New test.
	* gcc.target/i386/pr84413-2.c: Likewise.
	* gcc.target/i386/pr84413-3.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr84413-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr84413-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr84413-3.c
Modified:
    trunk/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/x86-tune.def
    trunk/gcc/testsuite/ChangeLog
Comment 6 hjl@gcc.gnu.org 2018-07-13 20:36:32 UTC
Author: hjl
Date: Fri Jul 13 20:36:01 2018
New Revision: 262650

URL: https://gcc.gnu.org/viewcvs?rev=262650&root=gcc&view=rev
Log:
x86: Tune Skylake, Cannonlake and Icelake as Haswell

r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations
which are enabled by PROCESSOR_HASWELL.  As the result, -mtune=skylake
generates slower codes on Skylake than before.  The same also applies
to Cannonlake and Icelak tuning.

This patch changes -mtune={skylake|cannonlake|icelake} to tune like
-mtune=haswell for until their tuning is properly adjusted. It also
enables -mprefer-vector-width=256 for -mtune=haswell, which has no
impact on codegen when AVX512 isn't enabled.

Performance impacts on SPEC CPU 2017 rate with 1 copy using

-march=native -mfpmath=sse -O2 -m64

are

1. On Broadwell server:

500.perlbench_r		-0.56%
502.gcc_r		-0.18%
505.mcf_r		0.24%
520.omnetpp_r		0.00%
523.xalancbmk_r		-0.32%
525.x264_r		-0.17%
531.deepsjeng_r		0.00%
541.leela_r		0.00%
548.exchange2_r		0.12%
557.xz_r		0.00%
Geomean			0.00%

503.bwaves_r		0.00%
507.cactuBSSN_r		0.21%
508.namd_r		0.00%
510.parest_r		0.19%
511.povray_r		-0.48%
519.lbm_r		0.00%
521.wrf_r		0.28%
526.blender_r		0.19%
527.cam4_r		0.39%
538.imagick_r		0.00%
544.nab_r		-0.36%
549.fotonik3d_r		0.51%
554.roms_r		0.00%
Geomean			0.17%

On Skylake client:

500.perlbench_r		0.96%
502.gcc_r		0.13%
505.mcf_r		-1.03%
520.omnetpp_r		-1.11%
523.xalancbmk_r		1.02%
525.x264_r		0.50%
531.deepsjeng_r		2.97%
541.leela_r		0.50%
548.exchange2_r		-0.95%
557.xz_r		2.41%
Geomean			0.56%

503.bwaves_r		0.49%
507.cactuBSSN_r		3.17%
508.namd_r		4.05%
510.parest_r		0.15%
511.povray_r		0.80%
519.lbm_r		3.15%
521.wrf_r		10.56%
526.blender_r		2.97%
527.cam4_r		2.36%
538.imagick_r		46.40%
544.nab_r		2.04%
549.fotonik3d_r		0.00%
554.roms_r		1.27%
Geomean			5.49%

On Skylake server:

500.perlbench_r		0.71%
502.gcc_r		-0.51%
505.mcf_r		-1.06%
520.omnetpp_r		-0.33%
523.xalancbmk_r		-0.22%
525.x264_r		1.72%
531.deepsjeng_r		-0.26%
541.leela_r		0.57%
548.exchange2_r		-0.75%
557.xz_r		-1.28%
Geomean			-0.21%

503.bwaves_r		0.00%
507.cactuBSSN_r		2.66%
508.namd_r		3.67%
510.parest_r		1.25%
511.povray_r		2.26%
519.lbm_r		1.69%
521.wrf_r		11.03%
526.blender_r		3.39%
527.cam4_r		1.69%
538.imagick_r		64.59%
544.nab_r		-0.54%
549.fotonik3d_r		2.68%
554.roms_r		0.00%
Geomean			6.19%

This patch improves -march=native performance on Skylake up to 60% and
leaves -march=native performance unchanged on Haswell.

gcc/

	Backport from mainline
	2018-07-13  H.J. Lu  <hongjiu.lu@intel.com>
		    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/84413
	* config/i386/i386.c (m_CORE_AVX512): New.
	(m_CORE_AVX2): Likewise.
	(m_CORE_ALL): Add m_CORE_AVX2.
	* config/i386/x86-tune.def: Replace m_HASWELL with m_CORE_AVX2.
	Replace m_SKYLAKE_AVX512 with m_CORE_AVX512 on avx256_optimal
	and remove the rest of m_SKYLAKE_AVX512.

gcc/testsuite/

	Backport from mainline
	2018-07-13  H.J. Lu  <hongjiu.lu@intel.com>
		    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/84413
	* gcc.target/i386/pr84413-1.c: New test.
	* gcc.target/i386/pr84413-2.c: Likewise.
	* gcc.target/i386/pr84413-3.c: Likewise.

Added:
    branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/pr84413-1.c
    branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/pr84413-2.c
    branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/pr84413-3.c
Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/config/i386/i386.c
    branches/gcc-8-branch/gcc/config/i386/x86-tune.def
    branches/gcc-8-branch/gcc/testsuite/ChangeLog
Comment 7 H.J. Lu 2018-07-13 20:37:34 UTC
Fixed for GCC 8.2 and GCC 9.