This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: option -mprfchw on 2 different Opteron cpus


On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
>> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw
>> listed in the options in -fverbose-asm.  In the assembly, I see this:
>>
>> prefetcht0      (%rax)  # ivtmp.1160
>> prefetcht0      304(%rcx)       #
>> prefetcht0      (%rax)  # ivtmp.1160
>
> In AMD processors -mprfchw flag  is used to enable "3dnowprefetch" ISA support.
>
> (Snip)
> CPUID Fn8000_0001_ECX Feature Identifiers
> Bit 8
> 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See âPREFETCHâ and
> âPREFETCHWâ in APM3
> Ref: http://support.amd.com/TechDocs/25481.pdf
> (Snip)
>
> Can you please confirm what this CPUID flag returns on your k8 machine ?.
> I believe this ISA is not available on k8 machine so when -march=native is added you donât see  -mprfchw in verbose.

Looks like zero?  This was generated with the cpuid program from
http://www.etallen.com/cpuid.html

CPU 0:
   0x00000000 0x00: eax=0x00000001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
   0x00000001 0x00: eax=0x00000f58 ebx=0x00000800 ecx=0x00000000 edx=0x078bfbff
   0x80000000 0x00: eax=0x80000018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65
   0x80000001 0x00: eax=0x00000f58 ebx=0x00000405 ecx=0x00000000 edx=0xe1d3fbff
   0x80000002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74
   0x80000003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x00000038
   0x80000004 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140
   0x80000006 0x00: eax=0x00000000 ebx=0x42004200 ecx=0x04008140 edx=0x00000000
   0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000009
   0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000b 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000d 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x8000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000014 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000015 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000016 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000017 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000018 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80860000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0xc0000000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000

CPU:
   vendor_id = "AuthenticAMD"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon
XP-M/Opteron/Sempron/Turion (15)
      model           = 0x5 (5)
      stepping id     = 0x8 (8)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x0 (0)
      cpu count                      = 0x0 (0)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0x0 (0)
   brand id = 0x00 (0): unknown
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = false
      thermal monitor and clock ctrl         = false
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = false
      hyper-threading / multi-core supported = false
      therm. monitor                         = false
      IA64                                   = false
      pending break event                    = false
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = false
      PCLMULDQ instruction                    = false
      64-bit debug store                      = false
      MONITOR/MWAIT                           = false
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      thermal monitor 2                       = false
      SSSE3 extensions                        = false
      context ID: adaptive or shared L1 data  = false
      FMA instruction                         = false
      CMPXCHG16B instruction                  = false
      xTPR disable                            = false
      perfmon and debug                       = false
      process context identifiers             = false
      direct cache access                     = false
      SSE4.1 extensions                       = false
      SSE4.2 extensions                       = false
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = false
      time stamp counter deadline             = false
      AES instruction                         = false
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   extended processor signature (0x80000001/eax):
      family/generation = AMD Athlon 64/Opteron/Sempron/Turion (15)
      model             = 0x5 (5)
      stepping id       = 0x8 (8)
      extended family   = 0x0 (0)
      extended model    = 0x0 (0)
      (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon
64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um
   extended feature flags (0x80000001/edx):
      x87 FPU on chip                       = true
      virtual-8086 mode enhancement         = true
      debugging extensions                  = true
      page size extensions                  = true
      time stamp counter                    = true
      RDMSR and WRMSR support               = true
      physical address extensions           = true
      machine check exception               = true
      CMPXCHG8B inst.                       = true
      APIC on chip                          = true
      SYSCALL and SYSRET instructions       = true
      memory type range registers           = true
      global paging extension               = true
      machine check architecture            = true
      conditional move/compare instruction  = true
      page attribute table                  = true
      page size extension                   = true
      multiprocessing capable               = false
      no-execute page protection            = true
      AMD multimedia instruction extensions = true
      MMX Technology                        = true
      FXSAVE/FXRSTOR                        = true
      SSE extensions                        = false
      1-GB large page support               = false
      RDTSCP                                = false
      long mode (AA-64)                     = true
      3DNow! instruction extensions         = true
      3DNow! instructions                   = true
   extended brand id (0x80000001/ebx):
      raw             = 0x405 (1029)
      BrandId         = 0x405 (1029)
      BrandTableIndex = 0x10 (16)
      NN              = 0x5 (5)
   AMD feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = false
      CMP Legacy                             = false
      SVM: secure virtual machine            = false
      extended APIC space                    = false
      AltMovCr8                              = false
      LZCNT advanced bit manipulation        = false
      SSE4A support                          = false
      misaligned SSE mode                    = false
      3DNow! PREFETCH/PREFETCHW instructions = false
      OS visible workaround                  = false
      instruction based sampling             = false
      XOP support                            = false
      SKINIT/STGI support                    = false
      watchdog timer support                 = false
      lightweight profiling support          = false
      4-operand FMA instruction              = false
      NodeId MSR C001100C                    = false
      TBM support                            = false
      topology extensions                    = false
   brand = "AMD Opteron(tm) Processor 248"
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0x8 (8)
      instruction associativity = 0xff (255)
      data # entries            = 0x8 (8)
      data associativity        = 0xff (255)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0x20 (32)
      instruction associativity = 0xff (255)
      data # entries            = 0x20 (32)
      data associativity        = 0xff (255)
   L1 data cache information (0x80000005/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 0x2 (2)
      size (Kb)         = 0x40 (64)
   L1 instruction cache information (0x80000005/edx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 0x2 (2)
      size (Kb)         = 0x40 (64)
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x200 (512)
      instruction associativity = 4-way (4)
      data # entries            = 0x200 (512)
      data associativity        = 4-way (4)
   L2 unified cache information (0x80000006/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 16-way (8)
      size (Kb)         = 0x400 (1024)
   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x0 (0)
      lines per tag         = 0x0 (0)
      associativity         = L2 off (0)
      size (in 512Kb units) = 0x0 (0)
   Advanced Power Management Features (0x80000007/edx):
      temperature sensing diode      = true
      frequency ID (FID) control     = false
      voltage ID (VID) control       = false
      thermal trip (TTP)             = true
      thermal monitor (TM)           = false
      software thermal control (STC) = false
      100 MHz multiplier control     = false
      hardware P-State control       = false
      TscInvariant                   = false
   Physical Address and Linear Address Size (0x80000008/eax):
      maximum physical address bits         = 0x28 (40)
      maximum linear (virtual) address bits = 0x30 (48)
      maximum guest physical address bits   = 0x0 (0)
   Logical CPU cores (0x80000008/ecx):
      number of CPU cores - 1 = 0x0 (0)
      ApicIdCoreIdSize        = 0x0 (0)
   SVM Secure Virtual Machine (0x8000000a/eax):
      SvmRev: SVM revision = 0x0 (0)
   SVM Secure Virtual Machine (0x8000000a/edx):
      nested paging                 = false
      LBR virtualization            = false
      SVM lock                      = false
      NRIP save                     = false
      MSR based TSC rate control    = false
      VMCB clean bits support       = false
      flush by ASID                 = false
      decode assists                = false
      SSSE3/SSE5 opcode set disable = false
      pause intercept filter        = false
      pause filter threshold        = false
   NASID: number of address space identifiers = 0x0 (0):
   (instruction supported synth):
      CMPXCHG8B                = true
      conditional move/compare = true
      PREFETCH/PREFETCHW       = true
   (multi-processing synth): none
   (multi-processing method): AMD
   (synth) = AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Processor 248

>>
>> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to
>> target the older system), I do see it listed in the options in -fverbose-asm.  In
>> the assembly, I see this:
>
> K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch).
> https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html
> So when you add -march=k8 you see -mprfchw  getting listed in verbose.
>
>>
>> prefetcht0      (%rax)  # ivtmp.1160
>> prefetcht0      304(%rcx)       #
>> prefetchw       (%rax)  # ivtmp.1160
>>
>> (The third line is the only difference)
>>
>
> This is my guess without seeing the test case, when write  prefetching is requested "prefetchw" is generated.
> 3dnow (TARGET_3DNOW) ISA has support for it.
>
> (Snip)
> Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID
> Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR
> Fn8000_0001_EDX[3DNow] = 1.
> (Snip)
> Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf
>
>> In both cases, I'm using gcc 4.9.3.  Which is correct for a k8 Opteron 248?
>>
>> Also, FWIW:
>>
>> 1) The march=native version that uses prefetcht0 is very repeatably faster by
>> about 15% in the particular test case I'm looking at.
>>
>> 2) The compilers in both instances are not just the same version, they are the
>> same compiler binary installed on an NFS mount and shared to both
>> computers.
>
> As per GCC4.9.3 source.
>
> (Snip)
> (define_expand "prefetch"
>   [(prefetch (match_operand 0 "address_operand")
>              (match_operand:SI 1 "const_int_operand")
>              (match_operand:SI 2 "const_int_operand"))]
>   "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1"
> {
>   bool write = INTVAL (operands[1]) != 0;
>   int locality = INTVAL (operands[2]);
>
>   gcc_assert (IN_RANGE (locality, 0, 3));
>
>   /* Use 3dNOW prefetch in case we are asking for write prefetch not
>      supported by SSE counterpart or the SSE prefetch is not available
>      (K6 machines).  Otherwise use SSE prefetch as it allows specifying
>      of locality.  */
>   if (TARGET_PREFETCHWT1 && write && locality <= 2)
>     operands[2] = const2_rtx;
>   else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
>     operands[2] = GEN_INT (3);
>   else
>     operands[1] = const0_rtx;
> })
> (Snip)
>
> Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false.
>    else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE))
> TARGET_PRFCHW is off on native.
>
> So there are two issues here.
>
> (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine.
> (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 .
>
> Regards,
> Venkat.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]