This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: PING^1: [PATCH GCC 8] x86: Re-enable partial_reg_dependency and movx for Haswell
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, Sebastian Peryt <sebastian dot peryt at intel dot com>, Uros Bizjak <ubizjak at gmail dot com>
- Date: Thu, 31 May 2018 08:36:18 -0700
- Subject: Re: PING^1: [PATCH GCC 8] x86: Re-enable partial_reg_dependency and movx for Haswell
- References: <CAMe9rOp1JfFZny67jefC0AA+fUDvYxg04A93dT56Ly2ZzeAx0Q@mail.gmail.com> <CAMe9rOpghzvV=2cH9KbvP7enAvVyxANOB3XSAHKbebw8Kmu9fQ@mail.gmail.com> <20180531150809.GG55777@kam.mff.cuni.cz>
On Thu, May 31, 2018 at 8:08 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> This is the patch I am going to check into GCC 8.
>>
>> --
>> H.J.
>
>> From 9ecbfa1fd04dc4370a9ec4f3d56189cc07aee668 Mon Sep 17 00:00:00 2001
>> From: "H.J. Lu" <hjl.tools@gmail.com>
>> Date: Thu, 17 May 2018 09:52:09 -0700
>> Subject: [PATCH] x86: Re-enable partial_reg_dependency and movx for Haswell
>>
>> r254152 disabled partial_reg_dependency and movx for Haswell and newer
>> Intel processors. r258972 restored them for skylake-avx512. For Haswell,
>> movx improves performance. But partial_reg_stall may be better than
>> partial_reg_dependency in theory. We will investigate performance impact
>> of partial_reg_stall vs partial_reg_dependency on Haswell for GCC 9. In
>> the meantime, this patch restores both partial_reg_dependency and mox for
>> Haswell in GCC 8.
>>
>> On Haswell, improvements for EEMBC benchmarks with
>>
>> -mtune-ctrl=movx,partial_reg_dependency -Ofast -march=haswell
>>
>> vs
>>
>> -Ofast -mtune=haswell
>>
>> are
>>
>> automotive
>> =========
>> aifftr01 (default) - goodperf: Runtime improvement of 2.6% (time).
>> aiifft01 (default) - goodperf: Runtime improvement of 2.2% (time).
>>
>> networking
>> =========
>> ip_pktcheckb1m (default) - goodperf: Runtime improvement of 3.8% (time).
>> ip_pktcheckb2m (default) - goodperf: Runtime improvement of 5.2% (time).
>> ip_pktcheckb4m (default) - goodperf: Runtime improvement of 4.4% (time).
>> ip_pktcheckb512k (default) - goodperf: Runtime improvement of 4.2% (time).
>>
>> telecom
>> =========
>> fft00data_1 (default) - goodperf: Runtime improvement of 8.4% (time).
>> fft00data_2 (default) - goodperf: Runtime improvement of 8.6% (time).
>> fft00data_3 (default) - goodperf: Runtime improvement of 9.0% (time).
>
> Thanks for data. Why did you commited the patch to release branch only?
> The patch is OK for mainline too.
I am checking this patch into trunk now.
> I do not have access to the benchmark so I can not check. Why do we get
>From Intel optimization guide:
3.5.2.4
Partial Register Stalls
General purpose registers can be accessed in granularities of bytes,
words, doublewords; 64-bit mode
also supports quadword granularity. Referencing a portion of a
register is referred to as a partial register
reference.
A partial register stall happens when an instruction refers to a
register, portions of which were previously
modified by other instructions. For example, partial register stalls
occurs with a read to AX while previous
instructions stored AL and AH, or a read to EAX while previous
instruction modified AX.
The delay of a partial register stall is small in processors based on
Intel Core and NetBurst microarchitec-
tures, and in Pentium M processor (with CPUID signature family 6,
model 13), Intel Core Solo, and Intel
Core Duo processors. Pentium M processors (CPUID signature with family
6, model 9) and the P6 family
incur a large penalty.
Note that in Intel 64 architecture, an update to the lower 32 bits of
a 64 bit integer register is architec-
turally defined to zero extend the upper 32 bits. While this action
may be logically viewed as a 32 bit
update, it is really a 64 bit update (and therefore does not cause a
partial stall).
Referencing partial registers frequently produces code sequences with
either false or real dependencies.
Example 3-18 demonstrates a series of false and real dependencies
caused by referencing partial regis-
ters.
...
When you want to load from memory to a partial register, consider
using MOVZX or MOVSX to
avoid the additional merge micro-op penalty.
We have movx, partial_reg_dependency and partial_reg_stall to deal with
it. movx is always good. But partial_reg_stall is enabled only for i686. We
need to investigate partial_reg_stall vs partial_reg_dependency on Haswell+.
> the improvements here and how does that behave on skylake+?
This is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84413
We are working on it.
> honza
>>
>> PR target/85829
>> * config/i386/x86-tune.def: Re-enable partial_reg_dependency
>> and movx for Haswell.
>> ---
>> gcc/config/i386/x86-tune.def | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
>> index 5649fdcf416..60625668236 100644
>> --- a/gcc/config/i386/x86-tune.def
>> +++ b/gcc/config/i386/x86-tune.def
>> @@ -48,7 +48,7 @@ DEF_TUNE (X86_TUNE_SCHEDULE, "schedule",
>> over partial stores. For example preffer MOVZBL or MOVQ to load 8bit
>> value over movb. */
>> DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
>> - m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE
>> + m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_HASWELL
>> | m_BONNELL | m_SILVERMONT | m_INTEL
>> | m_KNL | m_KNM | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 | m_GENERIC)
>>
>> @@ -84,7 +84,7 @@ DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall",
>> partial dependencies. */
>> DEF_TUNE (X86_TUNE_MOVX, "movx",
>> m_PPRO | m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE
>> - | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
>> + | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL | m_HASWELL
>> | m_GEODE | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 | m_GENERIC)
>>
>> /* X86_TUNE_MEMORY_MISMATCH_STALL: Avoid partial stores that are followed by
>> --
>> 2.17.0
>>
>
--
H.J.
From b76a9074e8919f63934e04083e67371a6090e7a0 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Thu, 17 May 2018 09:52:09 -0700
Subject: [PATCH] x86: Re-enable partial_reg_dependency and movx for Haswell
r254152 disabled partial_reg_dependency and movx for Haswell and newer
Intel processors. r258972 restored them for skylake-avx512. For Haswell,
movx improves performance. But partial_reg_stall may be better than
partial_reg_dependency in theory. We will investigate performance impact
of partial_reg_stall vs partial_reg_dependency on Haswell for GCC 9. In
the meantime, this patch restores both partial_reg_dependency and mox for
Haswell in GCC 8.
On Haswell, improvements for EEMBC benchmarks with
-mtune-ctrl=movx,partial_reg_dependency -Ofast -march=haswell
vs
-Ofast -mtune=haswell
are
automotive
=========
aifftr01 (default) - goodperf: Runtime improvement of 2.6% (time).
aiifft01 (default) - goodperf: Runtime improvement of 2.2% (time).
networking
=========
ip_pktcheckb1m (default) - goodperf: Runtime improvement of 3.8% (time).
ip_pktcheckb2m (default) - goodperf: Runtime improvement of 5.2% (time).
ip_pktcheckb4m (default) - goodperf: Runtime improvement of 4.4% (time).
ip_pktcheckb512k (default) - goodperf: Runtime improvement of 4.2% (time).
telecom
=========
fft00data_1 (default) - goodperf: Runtime improvement of 8.4% (time).
fft00data_2 (default) - goodperf: Runtime improvement of 8.6% (time).
fft00data_3 (default) - goodperf: Runtime improvement of 9.0% (time).
PR target/85829
* config/i386/x86-tune.def: Re-enable partial_reg_dependency
and movx for Haswell.
---
gcc/config/i386/x86-tune.def | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 77d99340ebe..f95c0701d5d 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -49,7 +49,7 @@ DEF_TUNE (X86_TUNE_SCHEDULE, "schedule",
over partial stores. For example preffer MOVZBL or MOVQ to load 8bit
value over movb. */
DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
- m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE
+ m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_HASWELL
| m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL
| m_KNL | m_KNM | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 | m_GENERIC)
@@ -87,7 +87,7 @@ DEF_TUNE (X86_TUNE_MOVX, "movx",
m_PPRO | m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE
| m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_KNL | m_KNM | m_INTEL
| m_GOLDMONT_PLUS | m_GEODE | m_AMD_MULTIPLE | m_SKYLAKE_AVX512
- | m_GENERIC)
+ | m_HASWELL | m_GENERIC)
/* X86_TUNE_MEMORY_MISMATCH_STALL: Avoid partial stores that are followed by
full sized loads. */
--
2.17.0