[patch, libfortran] AMD-specific versions of library matmul

Thu May 25 14:11:00 GMT 2017

On 05/25/2017 03:45 AM, Thomas Koenig wrote:
> Hello world,
> 
> the attached patch speeds up the library version of matmul for AMD chips
> by selecting AVX128 instructions and, depending on which instructions
> are supported, either FMA3 (aka FMA) or FMA4.
> 
> Jerry tested this on his AMD systems, and found a speedup vs. the
> current code of around 10%.
> 
> I have been unable to test this on a Ryzen system (the new compile farm
> machines won't accept my login yet).  From the benchmarks I have read,
> this method should also work fairly well on a Ryzen.
> 
> So, OK for trunk?

Yes, OK.  Maybe test Ryzen first?

I just confirmed access to the Ryzen machines so I plan to get set up and test 
there.

Time to start looking under the hood.

cat /proc/cpuinfo gives for flags:

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 
clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 
constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni 
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c 
rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 
3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext 
perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap 
clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv 
svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter 
pfthreshold avic overflow_recov succor smca