This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: option -mprfchw on 2 different Opteron cpus
- From: NightStrike <nightstrike at gmail dot com>
- To: "Kumar, Venkataramanan" <Venkataramanan dot Kumar at amd dot com>
- Cc: "Uros Bizjak (ubizjak at gmail dot com)" <ubizjak at gmail dot com>, "lopezibanez at gmail dot com" <lopezibanez at gmail dot com>, Jan Hubicka <hubicka at ucw dot cz>, Jakub Jelinek <jakub at redhat dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Mon, 2 May 2016 13:01:26 -0400
- Subject: Re: option -mprfchw on 2 different Opteron cpus
- Authentication-results: sourceware.org; auth=none
- References: <CAF1jjLsyTdZhRj=3C56uxFgPmEefJ3vvJu8EdnKGPnxHrH_RjQ at mail dot gmail dot com> <CY1PR1201MB1098DD32228B401ABC8DDDB18F790 at CY1PR1201MB1098 dot namprd12 dot prod dot outlook dot com>
On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
>> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw
>> listed in the options in -fverbose-asm. In the assembly, I see this:
>>
>> prefetcht0 (%rax) # ivtmp.1160
>> prefetcht0 304(%rcx) #
>> prefetcht0 (%rax) # ivtmp.1160
>
> In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA support.
>
> (Snip)
> CPUID Fn8000_0001_ECX Feature Identifiers
> Bit 8
> 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See âPREFETCHâ and
> âPREFETCHWâ in APM3
> Ref: http://support.amd.com/TechDocs/25481.pdf
> (Snip)
>
> Can you please confirm what this CPUID flag returns on your k8 machine ?.
> I believe this ISA is not available on k8 machine so when -march=native is added you donât see -mprfchw in verbose.
Looks like zero? This was generated with the cpuid program from
http://www.etallen.com/cpuid.html
CPU 0:
0x00000000 0x00: eax=0x00000001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
0x00000001 0x00: eax=0x00000f58 ebx=0x00000800 ecx=0x00000000 edx=0x078bfbff
0x80000000 0x00: eax=0x80000018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
0x80000001 0x00: eax=0x00000f58 ebx=0x00000405 ecx=0x00000000 edx=0xe1d3fbff
0x80000002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74
0x80000003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x00000038
0x80000004 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140
0x80000006 0x00: eax=0x00000000 ebx=0x42004200 ecx=0x04008140 edx=0x00000000
0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000009
0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000b 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000d 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x8000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000014 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000015 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000016 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000017 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80000018 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x80860000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0xc0000000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
CPU:
vendor_id = "AuthenticAMD"
version information (1/eax):
processor type = primary processor (0)
family = Intel Pentium 4/Pentium D/Pentium Extreme
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon
XP-M/Opteron/Sempron/Turion (15)
model = 0x5 (5)
stepping id = 0x8 (8)
extended family = 0x0 (0)
extended model = 0x0 (0)
(simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
miscellaneous (1/ebx):
process local APIC physical ID = 0x0 (0)
cpu count = 0x0 (0)
CLFLUSH line size = 0x8 (8)
brand index = 0x0 (0)
brand id = 0x00 (0): unknown
feature information (1/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSENTER and SYSEXIT = true
memory type range registers = true
PTE global bit = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
processor serial number = false
CLFLUSH instruction = true
debug store = false
thermal monitor and clock ctrl = false
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = true
SSE2 extensions = true
self snoop = false
hyper-threading / multi-core supported = false
therm. monitor = false
IA64 = false
pending break event = false
feature information (1/ecx):
PNI/SSE3: Prescott New Instructions = false
PCLMULDQ instruction = false
64-bit debug store = false
MONITOR/MWAIT = false
CPL-qualified debug store = false
VMX: virtual machine extensions = false
SMX: safer mode extensions = false
Enhanced Intel SpeedStep Technology = false
thermal monitor 2 = false
SSSE3 extensions = false
context ID: adaptive or shared L1 data = false
FMA instruction = false
CMPXCHG16B instruction = false
xTPR disable = false
perfmon and debug = false
process context identifiers = false
direct cache access = false
SSE4.1 extensions = false
SSE4.2 extensions = false
extended xAPIC support = false
MOVBE instruction = false
POPCNT instruction = false
time stamp counter deadline = false
AES instruction = false
XSAVE/XSTOR states = false
OS-enabled XSAVE/XSTOR = false
AVX: advanced vector extensions = false
F16C half-precision convert instruction = false
RDRAND instruction = false
hypervisor guest status = false
extended processor signature (0x80000001/eax):
family/generation = AMD Athlon 64/Opteron/Sempron/Turion (15)
model = 0x5 (5)
stepping id = 0x8 (8)
extended family = 0x0 (0)
extended model = 0x0 (0)
(simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
extended feature flags (0x80000001/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSCALL and SYSRET instructions = true
memory type range registers = true
global paging extension = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
multiprocessing capable = false
no-execute page protection = true
AMD multimedia instruction extensions = true
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = false
1-GB large page support = false
RDTSCP = false
long mode (AA-64) = true
3DNow! instruction extensions = true
3DNow! instructions = true
extended brand id (0x80000001/ebx):
raw = 0x405 (1029)
BrandId = 0x405 (1029)
BrandTableIndex = 0x10 (16)
NN = 0x5 (5)
AMD feature flags (0x80000001/ecx):
LAHF/SAHF supported in 64-bit mode = false
CMP Legacy = false
SVM: secure virtual machine = false
extended APIC space = false
AltMovCr8 = false
LZCNT advanced bit manipulation = false
SSE4A support = false
misaligned SSE mode = false
3DNow! PREFETCH/PREFETCHW instructions = false
OS visible workaround = false
instruction based sampling = false
XOP support = false
SKINIT/STGI support = false
watchdog timer support = false
lightweight profiling support = false
4-operand FMA instruction = false
NodeId MSR C001100C = false
TBM support = false
topology extensions = false
brand = "AMD Opteron(tm) Processor 248"
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
instruction # entries = 0x8 (8)
instruction associativity = 0xff (255)
data # entries = 0x8 (8)
data associativity = 0xff (255)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
instruction # entries = 0x20 (32)
instruction associativity = 0xff (255)
data # entries = 0x20 (32)
data associativity = 0xff (255)
L1 data cache information (0x80000005/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 0x2 (2)
size (Kb) = 0x40 (64)
L1 instruction cache information (0x80000005/edx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 0x2 (2)
size (Kb) = 0x40 (64)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
instruction # entries = 0x200 (512)
instruction associativity = 4-way (4)
data # entries = 0x200 (512)
data associativity = 4-way (4)
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 16-way (8)
size (Kb) = 0x400 (1024)
L3 cache information (0x80000006/edx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = L2 off (0)
size (in 512Kb units) = 0x0 (0)
Advanced Power Management Features (0x80000007/edx):
temperature sensing diode = true
frequency ID (FID) control = false
voltage ID (VID) control = false
thermal trip (TTP) = true
thermal monitor (TM) = false
software thermal control (STC) = false
100 MHz multiplier control = false
hardware P-State control = false
TscInvariant = false
Physical Address and Linear Address Size (0x80000008/eax):
maximum physical address bits = 0x28 (40)
maximum linear (virtual) address bits = 0x30 (48)
maximum guest physical address bits = 0x0 (0)
Logical CPU cores (0x80000008/ecx):
number of CPU cores - 1 = 0x0 (0)
ApicIdCoreIdSize = 0x0 (0)
SVM Secure Virtual Machine (0x8000000a/eax):
SvmRev: SVM revision = 0x0 (0)
SVM Secure Virtual Machine (0x8000000a/edx):
nested paging = false
LBR virtualization = false
SVM lock = false
NRIP save = false
MSR based TSC rate control = false
VMCB clean bits support = false
flush by ASID = false
decode assists = false
SSSE3/SSE5 opcode set disable = false
pause intercept filter = false
pause filter threshold = false
NASID: number of address space identifiers = 0x0 (0):
(instruction supported synth):
CMPXCHG8B = true
conditional move/compare = true
PREFETCH/PREFETCHW = true
(multi-processing synth): none
(multi-processing method): AMD
(synth) = AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Processor 248
>>
>> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to
>> target the older system), I do see it listed in the options in -fverbose-asm. In
>> the assembly, I see this:
>
> K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch).
> https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
> So when you add -march=k8 you see -mprfchw getting listed in verbose.
>
>>
>> prefetcht0 (%rax) # ivtmp.1160
>> prefetcht0 304(%rcx) #
>> prefetchw (%rax) # ivtmp.1160
>>
>> (The third line is the only difference)
>>
>
> This is my guess without seeing the test case, when write prefetching is requested "prefetchw" is generated.
> 3dnow (TARGET_3DNOW) ISA has support for it.
>
> (Snip)
> Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID
> Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
> Fn8000_0001_EDX[3DNow] = 1.
> (Snip)
> Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
>
>> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248?
>>
>> Also, FWIW:
>>
>> 1) The march=native version that uses prefetcht0 is very repeatably faster by
>> about 15% in the particular test case I'm looking at.
>>
>> 2) The compilers in both instances are not just the same version, they are the
>> same compiler binary installed on an NFS mount and shared to both
>> computers.
>
> As per GCC4.9.3 source.
>
> (Snip)
> (define_expand "prefetch"
> [(prefetch (match_operand 0 "address_operand")
> (match_operand:SI 1 "const_int_operand")
> (match_operand:SI 2 "const_int_operand"))]
> "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
> {
> bool write = INTVAL (operands[1]) != 0;
> int locality = INTVAL (operands[2]);
>
> gcc_assert (IN_RANGE (locality, 0, 3));
>
> /* Use 3dNOW prefetch in case we are asking for write prefetch not
> supported by SSE counterpart or the SSE prefetch is not available
> (K6 machines). Otherwise use SSE prefetch as it allows specifying
> of locality. */
> if (TARGET_PREFETCHWT1 && write && locality <= 2)
> operands[2] = const2_rtx;
> else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> operands[2] = GEN_INT (3);
> else
> operands[1] = const0_rtx;
> })
> (Snip)
>
> Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false.
> else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> TARGET_PRFCHW is off on native.
>
> So there are two issues here.
>
> (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine.
> (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 .
>
> Regards,
> Venkat.