Now that a patch for PR51119 is in, we can think about inserting processor-specific versions. target_clones looks to be a good idea for this, see https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Common-Function-Attributes.html There are still a few issues to be resolved, for example which architectures to chose. Also, selecting an architecture which does not exist on the platform leads to errors, so we probably need to guard with appropriate #ifdefs. A wrapper function to call the actual matmul is probably a good idea, because it is the caller who generates the code to select.
Created attachment 40074 [details] Test program for benchmarks
Here are some measurements with the AVX-enabling patch. They were done on an AVX machine, namely gcc75 from the compile farm. This was done with the command line gfortran -static-libgfortran -finline-matmul-limit=0 -Ofast -o compare_mavx compare_2.f90 Uncontidionally setting -mavx in the Makefile for matmul, with stock trunk: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 0.067 0.077 0.051 0.069 3 5000 0.193 0.218 0.157 0.194 4 5000 0.429 0.423 0.368 0.435 5 5000 0.609 0.659 0.556 0.630 7 5000 0.948 1.018 0.931 1.009 8 5000 1.608 1.251 1.589 1.715 9 5000 1.755 1.484 1.745 1.856 15 5000 2.710 2.175 2.963 3.105 16 5000 4.289 2.510 4.541 4.784 17 5000 4.411 3.032 4.675 4.888 31 5000 6.165 4.395 6.912 6.902 32 5000 8.800 4.362 8.793 8.809 33 5000 8.156 4.463 8.145 8.193 63 5000 9.727 4.364 9.709 9.716 64 5000 11.828 4.023 11.810 11.798 65 5000 10.726 4.489 10.654 10.725 127 3920 12.144 4.292 12.281 12.268 128 3829 13.829 4.484 13.807 13.841 129 3741 12.986 4.438 12.964 12.985 255 483 14.446 4.571 14.462 14.442 256 477 15.738 4.707 15.744 15.738 257 472 13.981 4.565 13.995 13.990 511 60 14.954 4.674 14.977 14.933 512 59 16.120 4.840 16.137 16.062 513 59 14.488 4.392 14.497 14.490 1023 7 15.011 3.573 15.021 14.995 1024 7 15.938 3.489 15.947 15.938 1025 7 14.670 3.568 14.683 14.627 With library-side switching (https://gcc.gnu.org/ml/gcc-patches/2016-11/msg01810.html): ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 0.067 0.080 0.053 0.067 3 5000 0.192 0.226 0.159 0.192 4 5000 0.427 0.436 0.364 0.431 5 5000 0.588 0.664 0.543 0.621 7 5000 0.938 0.914 0.926 1.011 8 5000 1.589 1.235 1.558 1.671 9 5000 1.704 1.486 1.694 1.810 15 5000 2.638 2.175 2.854 3.031 16 5000 4.234 2.532 4.533 4.745 17 5000 4.374 3.044 4.677 4.839 31 5000 6.207 4.401 6.891 6.918 32 5000 8.824 4.364 8.614 8.603 33 5000 7.954 4.349 7.945 7.944 63 5000 8.802 4.369 9.728 9.764 64 5000 11.845 4.025 11.783 11.849 65 5000 10.753 4.595 10.719 10.753 127 3920 12.023 4.314 12.285 12.004 128 3829 13.427 4.369 13.722 13.742 129 3741 12.877 4.323 12.668 12.985 255 483 14.398 4.453 14.336 13.496 256 477 15.708 4.680 15.711 15.465 257 472 13.977 4.439 13.965 13.977 511 60 14.920 4.691 14.937 14.939 512 59 15.959 4.787 16.084 16.082 513 59 14.444 4.636 14.464 14.452 1023 7 14.978 3.448 14.979 14.980 1024 7 15.903 3.640 15.900 15.905 1025 7 14.638 3.464 14.626 14.636 With stock trunk: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 0.072 0.078 0.053 0.072 3 5000 0.199 0.224 0.165 0.200 4 5000 0.458 0.403 0.387 0.462 5 5000 0.629 0.661 0.563 0.651 7 5000 1.073 1.010 1.029 1.131 8 5000 1.671 1.234 1.637 1.760 9 5000 1.732 1.465 1.720 1.829 15 5000 2.895 2.152 3.195 3.349 16 5000 3.870 2.483 4.168 4.318 17 5000 3.976 3.029 4.253 4.424 31 5000 6.210 4.403 6.861 6.868 32 5000 7.551 4.293 7.544 7.509 33 5000 7.119 4.418 7.094 7.090 63 5000 8.742 4.377 8.753 8.728 64 5000 9.415 4.019 9.384 9.260 65 5000 8.882 4.540 8.842 8.856 127 3920 10.073 4.432 9.966 9.988 128 3829 10.556 4.469 10.552 10.405 129 3741 9.923 4.428 9.990 9.930 255 483 10.827 4.569 10.875 10.768 256 477 11.328 4.705 11.281 11.129 257 472 10.402 4.492 10.344 10.360 511 60 10.947 4.674 11.003 10.938 512 59 11.503 4.842 11.504 11.314 513 59 10.654 4.672 10.651 10.619 1023 7 10.941 3.641 10.944 10.863 1024 7 11.370 3.587 11.261 11.193 1025 7 10.734 3.601 10.652 10.704 With inlined, -Ofast without -mavx: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 8.979 0.078 0.154 0.241 3 5000 14.042 0.224 0.348 0.451 4 5000 1.686 0.435 0.500 0.707 5 5000 1.989 0.617 0.577 0.829 7 5000 2.163 0.846 0.783 1.123 8 5000 3.742 1.224 0.879 1.322 9 5000 2.764 1.420 0.996 1.458 15 5000 3.461 2.108 1.305 2.420 16 5000 4.395 2.589 1.619 2.901 17 5000 5.238 3.291 1.934 3.579 31 5000 7.207 4.434 2.347 4.385 32 5000 7.318 4.306 2.351 4.329 33 5000 7.204 4.466 2.052 4.421 63 5000 4.688 4.365 2.486 4.700 64 5000 4.246 4.022 2.480 4.664 65 5000 4.238 4.355 2.486 4.703 127 3920 4.411 4.427 2.821 4.340 128 3829 4.365 4.481 2.846 4.434 129 3741 4.427 4.441 2.828 4.396 255 483 4.561 4.569 2.972 4.517 256 477 4.666 4.701 2.905 4.685 257 472 4.520 4.573 2.974 4.550 511 60 4.669 4.675 3.075 4.666 512 59 4.823 4.843 3.095 4.835 513 59 4.655 4.672 3.077 4.651 1023 7 3.555 3.563 2.718 3.554 1024 7 3.519 3.529 2.713 3.519 1025 7 3.527 3.543 2.715 3.536 With inline version with -mavx: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 8.990 0.074 0.155 0.206 3 5000 7.488 0.212 0.304 0.396 4 5000 1.773 0.342 0.501 0.533 5 5000 2.000 0.552 0.615 0.739 7 5000 2.163 0.919 0.807 1.057 8 5000 3.369 1.388 0.905 1.578 9 5000 2.694 1.347 1.020 1.492 15 5000 3.441 2.201 1.325 2.631 16 5000 1.831 3.399 1.677 4.137 17 5000 4.554 3.461 1.976 4.120 31 5000 7.111 5.286 2.372 5.712 32 5000 8.384 5.887 2.040 6.725 33 5000 7.218 5.374 2.057 5.798 63 5000 8.131 6.107 2.477 6.418 64 5000 8.707 6.518 2.313 7.228 65 5000 7.768 6.003 2.427 4.503 127 3920 6.714 5.688 2.761 6.293 128 3829 7.067 6.688 2.777 6.880 129 3741 6.277 6.023 2.765 6.296 255 483 6.036 5.681 2.877 5.765 256 477 6.177 5.869 2.921 5.917 257 472 6.017 5.687 2.880 5.766 511 60 6.156 5.878 2.848 5.920 512 59 6.338 6.107 3.026 6.092 513 59 6.125 5.826 2.954 5.817 1023 7 4.130 4.111 2.623 4.104 1024 7 4.270 4.219 2.667 4.198 1025 7 4.206 4.159 2.616 4.149
I did apply your second patch: I do not get any improvement and results are diminished from current trunk, so I am missing something. This is same machine I used showing results in 51119. It does have avx. flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold $ gfc -static-libgfortran -finline-matmul-limit=0 -Ofast -o compare_mavx compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 5.043 0.045 0.091 0.150 4 2000 1.417 0.235 0.353 0.325 8 2000 2.016 0.634 0.862 2.021 16 2000 5.332 2.834 2.239 2.929 32 2000 6.169 3.496 1.931 3.289 64 2000 2.656 2.836 2.655 2.657 128 2000 2.898 3.286 2.901 2.901 256 477 3.157 3.429 3.156 3.157 512 59 3.082 2.356 3.133 3.126 1024 7 3.102 1.363 3.144 3.136 2048 1 3.099 1.685 3.144 3.140
(In reply to Jerry DeLisle from comment #3) > I did apply your second patch: > > I do not get any improvement and results are diminished from current trunk, > so I am missing something. This is same machine I used showing results in > 51119. It does have avx. You have AMD processor, can you try with -mprefer-avx128 option?
(In reply to Uroš Bizjak from comment #4) > (In reply to Jerry DeLisle from comment #3) > > I did apply your second patch: > > > > I do not get any improvement and results are diminished from current trunk, > > so I am missing something. This is same machine I used showing results in > > 51119. It does have avx. > > You have AMD processor, can you try with -mprefer-avx128 option? You may notice I was invoking the wrong executable in what I posted in comment #3. I did rerun the correct one several times and tried it with -mavx -mprefer-avx128. I get the same poor results regardless.
> You may notice I was invoking the wrong executable in what I posted in > comment #3. I did rerun the correct one several times and tried it with > -mavx -mprefer-avx128. I get the same poor results regardless. Several things could go wrong here... If you run the benchmark under gdb and break, then type "disassemble $pc,$pc+200", do you actually end up in the right program part (the one with AVX instructions)? Or does your machine prefer AVX128? To find out, what are the timings for inline code using -mavx -Ofast -mavx -mprefer=avx128 -Ofast ?
And one more thing. Comparing the timing you get for the version with the target_clone and a version using just -mavx added to the relevant line in the Makefile, do you see a difference?
(In reply to Thomas Koenig from comment #6) > > You may notice I was invoking the wrong executable in what I posted in > > comment #3. I did rerun the correct one several times and tried it with > > -mavx -mprefer-avx128. I get the same poor results regardless. > > Several things could go wrong here... > > If you run the benchmark under gdb and break, then type > "disassemble $pc,$pc+200", do you actually end up in the right > program part (the one with AVX instructions)? 452 f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] (gdb) disassemble $pc,$pc+200 Dump of assembler code from 0x7ffff7af3554 to 0x7ffff7af361c: => 0x00007ffff7af3554 <aux_matmul_r8+5220>: vaddpd %ymm12,%ymm4,%ymm4 0x00007ffff7af3559 <aux_matmul_r8+5225>: vmulpd %ymm10,%ymm15,%ymm12 0x00007ffff7af355e <aux_matmul_r8+5230>: vaddpd %ymm11,%ymm5,%ymm5 0x00007ffff7af3563 <aux_matmul_r8+5235>: vmulpd %ymm14,%ymm15,%ymm15 0x00007ffff7af3568 <aux_matmul_r8+5240>: vmulpd %ymm10,%ymm13,%ymm10 0x00007ffff7af356d <aux_matmul_r8+5245>: vaddpd %ymm12,%ymm6,%ymm6 0x00007ffff7af3572 <aux_matmul_r8+5250>: vmulpd %ymm14,%ymm13,%ymm14 0x00007ffff7af3577 <aux_matmul_r8+5255>: vaddpd %ymm15,%ymm8,%ymm8 0x00007ffff7af357c <aux_matmul_r8+5260>: vaddpd %ymm10,%ymm7,%ymm7 0x00007ffff7af3581 <aux_matmul_r8+5265>: vaddpd %ymm14,%ymm9,%ymm9 0x00007ffff7af3586 <aux_matmul_r8+5270>: ja 0x7ffff7af3433 <aux_matmul_r8+4931> 0x00007ffff7af358c <aux_matmul_r8+5276>: mov -0x801f8(%rbp),%rdx 0x00007ffff7af3593 <aux_matmul_r8+5283>: vhaddpd %ymm9,%ymm9,%ymm13 0x00007ffff7af3598 <aux_matmul_r8+5288>: vhaddpd %ymm8,%ymm8,%ymm15 0x00007ffff7af359d <aux_matmul_r8+5293>: vhaddpd %ymm7,%ymm7,%ymm7 0x00007ffff7af35a1 <aux_matmul_r8+5297>: vperm2f128 $0x1,%ymm13,%ymm13,%ymm11 0x00007ffff7af35a7 <aux_matmul_r8+5303>: vhaddpd %ymm5,%ymm5,%ymm5 0x00007ffff7af35ab <aux_matmul_r8+5307>: vperm2f128 $0x1,%ymm15,%ymm15,%ymm8 0x00007ffff7af35b1 <aux_matmul_r8+5313>: vaddpd %ymm11,%ymm13,%ymm12 0x00007ffff7af35b6 <aux_matmul_r8+5318>: vperm2f128 $0x1,%ymm7,%ymm7,%ymm13 0x00007ffff7af35bc <aux_matmul_r8+5324>: vaddpd %ymm8,%ymm15,%ymm14 0x00007ffff7af35c1 <aux_matmul_r8+5329>: vhaddpd %ymm6,%ymm6,%ymm6 ---Type <return> to continue, or q <return> to quit--- 0x00007ffff7af35c5 <aux_matmul_r8+5333>: vaddsd -0x80068(%rbp),%xmm12,%xmm10 0x00007ffff7af35cd <aux_matmul_r8+5341>: vaddsd -0x80070(%rbp),%xmm14,%xmm9 0x00007ffff7af35d5 <aux_matmul_r8+5349>: vperm2f128 $0x1,%ymm5,%ymm5,%ymm14 0x00007ffff7af35db <aux_matmul_r8+5355>: vhaddpd %ymm4,%ymm4,%ymm4 0x00007ffff7af35df <aux_matmul_r8+5359>: vaddpd %ymm13,%ymm7,%ymm11 0x00007ffff7af35e4 <aux_matmul_r8+5364>: vmovsd %xmm10,-0x80068(%rbp) 0x00007ffff7af35ec <aux_matmul_r8+5372>: vperm2f128 $0x1,%ymm6,%ymm6,%ymm10 0x00007ffff7af35f2 <aux_matmul_r8+5378>: vperm2f128 $0x1,%ymm4,%ymm4,%ymm13 0x00007ffff7af35f8 <aux_matmul_r8+5384>: vmovsd %xmm9,-0x80070(%rbp) 0x00007ffff7af3600 <aux_matmul_r8+5392>: vaddpd %ymm14,%ymm5,%ymm9 0x00007ffff7af3605 <aux_matmul_r8+5397>: vhaddpd %ymm0,%ymm0,%ymm0 0x00007ffff7af3609 <aux_matmul_r8+5401>: vaddsd -0x80058(%rbp),%xmm11,%xmm12 0x00007ffff7af3611 <aux_matmul_r8+5409>: vaddpd %ymm10,%ymm6,%ymm15 0x00007ffff7af3616 <aux_matmul_r8+5414>: vaddpd %ymm13,%ymm4,%ymm11 0x00007ffff7af361b <aux_matmul_r8+5419>: vperm2f128 $0x1,%ymm0,%ymm0,%ymm13 End of assembler dump. > > Or does your machine prefer AVX128? > > To find out, what are the timings for inline code using > > -mavx -Ofast > > -mavx -mprefer=avx128 -Ofast > > ? $ gfc -finline-matmul-limit=64 -Ofast compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 4.933 0.045 0.086 0.144 4 2000 1.418 0.225 0.271 0.347 8 2000 2.168 0.616 1.296 1.830 16 2000 5.330 2.824 1.784 2.907 32 2000 6.239 3.488 1.446 3.406 64 2000 2.650 2.746 1.552 2.691 $ gfc -finline-matmul-limit=64 -mavx -Ofast compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 6.934 0.042 0.091 0.134 4 2000 1.320 0.181 0.365 0.252 8 2000 1.007 0.446 1.595 0.982 16 2000 0.581 1.163 2.411 1.180 32 2000 1.346 1.276 2.061 1.277 64 2000 1.397 1.327 2.288 1.328 $ gfc -finline-matmul-limit=64 -mavx -mprefer-avx128 -Ofast compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 5.021 0.045 0.088 0.139 4 2000 1.607 0.202 0.288 0.341 8 2000 2.482 0.575 0.743 1.861 16 2000 5.674 2.804 1.809 2.792 32 2000 6.323 3.460 1.478 3.293 64 2000 2.714 2.832 1.582 2.694 If I put -mavx -prefer-avx128 in the Makefile.am I get as good or better than without your patch. I also see none of the HAVE_AVX defined in config. $ gfc -finline-matmul-limit=0 -Ofast compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 0.043 0.041 0.034 0.043 4 2000 0.272 0.234 0.223 0.256 8 2000 0.835 1.687 1.627 1.709 16 2000 2.886 2.887 2.859 2.869 32 2000 4.733 3.494 4.755 4.652 64 2000 6.933 2.837 6.933 6.877 128 2000 7.949 3.285 8.705 7.914 256 477 10.040 3.447 9.999 9.951 512 59 8.885 2.341 8.923 8.940 1024 7 8.937 1.367 8.978 8.991 2048 1 8.799 1.672 8.831 8.854 The following in config.h.in for what it is worth: /* Define if AVX instructions can be compiled. */ #undef HAVE_AVX /* Define if AVX2 instructions can be compiled. */ #undef HAVE_AVX2 /* Define if AVX512f instructions can be compiled. */ #undef HAVE_AVX512F
Next question - what happens if you add -mvzeroupper -mavx to the line in the Makefile? Does that make a difference in speed?
(In reply to Thomas Koenig from comment #9) > Next question - what happens if you add > > -mvzeroupper -mavx > > to the line in the Makefile? Does that make a difference in speed? -mvzeroupper slows all way down with or without -mprefer-avx128
One could consider running a reference matrix multiply of size 32 in a loop and do timing tests to determine whether to use -mprefer-avx128. 0n this machine from comment 8 mavx = 1.276 mavx mprefer-avx128 = 3.460 There is some margin there for a fairly good test. Or is there another way to tell?
(In reply to Jerry DeLisle from comment #11) > One could consider running a reference matrix multiply of size 32 in a loop > and do timing tests to determine whether to use -mprefer-avx128. 0n this > machine from comment 8 > > mavx = 1.276 mavx mprefer-avx128 = 3.460 > > There is some margin there for a fairly good test. Or is there another way > to tell? I read some advice on the net that certain types of AMD processors have AVX, but AVX128 is better for them. What exactly is your CPU model? What does /proc/cpuinfo say? gcc determines the cpu model (see runk/libgcc/config/i386/cpuinfo.c). We should be able to query the CPU model and dispatch for AVX128 or AVX (or the other variants) based on that.
OK, I think I have a rough idea how to do this. For querying the CPU model, we need to put the interface in libgcc/config/i386/cpuinfo.c into a separate header. Then we generate a list of matmul functions using m4, with a second parameter, which gives us the architecture, such as in $(M4) -Dfile=$@ -Darch=avx512f ... In the generated C files, we enclose the whole content inside HAVE_AVX512F, so nothing happens if the architecture is not supported by the compiler. The target attribute is also set there. On the first call to matmul, we check for the availability of AVX etc, we also check for prefrences such as AVX128 from the CPU model, and then set a static function pointer to the function we want to call. On each subsequent invocation, all we do is that (tail) call. How does this sound?
(In reply to Thomas Koenig from comment #12) > I read some advice on the net that certain types of AMD processors > have AVX, but AVX128 is better for them. > > What exactly is your CPU model? What does /proc/cpuinfo say? > I have three different machines here. I am sure they are all similar as they are A series. The first is for testing results posted here: $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 16 model name : AMD A10-5800K APU with Radeon(tm) HD Graphics stepping : 1 2nd: $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 18 model : 1 model name : AMD A6-3620 APU with Radeon(tm) HD Graphics stepping : 0 3rd: $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 22 model : 48 model name : AMD A8-6410 APU with AMD Radeon R5 Graphics stepping : 1 (In reply to Thomas Koenig from comment #13) > $(M4) -Dfile=$@ -Darch=avx512f ... > > In the generated C files, we enclose the whole content inside HAVE_AVX512F, > so nothing happens if the architecture is not supported by the compiler. > The target attribute is also set there. > > On the first call to matmul, we check for the availability of AVX > etc, we also check for prefrences such as AVX128 from the CPU model, > and then set a static function pointer to the function we want to call. > On each subsequent invocation, all we do is that (tail) call. > > How does this sound? This seems a bit complicated. The machines I have do OK without the aux-matmul and no machine specific compilation other than the current defaults that gcc uses with the flags I have inside the Makefile on current trunk. Can this be done without the first call to matmul?
OMG, the world of processors is more complicated than I thought. So, these rather modern AMD chips support AVX, but suck at it. Two questions: - Can you check if -mfma3 and/or -mfma4 make any difference? - If you start any program compiled with -g under the debugger, break anywhere (for example at the beginning of the main program) and do a "p __cpu_model", what do you get? I am halfway tempted to restrict the AVX* stuff to Intel processors only. At least, this way we will not make things worse for AMD processors.
(In reply to Thomas Koenig from comment #15) > OMG, the world of processors is more complicated than I thought. > So, these rather modern AMD chips support AVX, but suck at it. > > Two questions: > > - Can you check if -mfma3 and/or -mfma4 make any difference? > > - If you start any program compiled with -g under the debugger, break > anywhere (for example at the beginning of the main program) > and do a "p __cpu_model", what do you get? The A10-5800K p __cpu_model $1 = {__cpu_vendor = 2, __cpu_type = 5, __cpu_subtype = 8, __cpu_features = {883711}} The A8: p __cpu_model $2 = {__cpu_vendor = 2, __cpu_type = 9, __cpu_subtype = 0, __cpu_features = {855039}} The A6: p __cpu_model $1 = {__cpu_vendor = 2, __cpu_type = 0, __cpu_subtype = 0, __cpu_features = {2111}} neither -mfma nor -mfma4 help
On a hunch, this brings it back. $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4 -march=native So -march=native fixes it. not quite as fast as -prefer-avx128, but close enough
Created attachment 40119 [details] Version that works (AVX only) Here is a version that should only do AVX stuff on Intel processors. Optimization for other processor types could come later.
Created attachment 40120 [details] Updated patch Well, here's an update also for AVX512F. I can confirm the patch gives the same performance as the AVX version on a machine that supports AVX. Untested on AVX512, because I don't have a machine for that. Adding AVX2 would be fairly trivial. I'm not sure that yanking out the info into the new cpuinfo.h header file is the way to go, but I am not sure of a better way to do it. Other comments?
(In reply to Thomas Koenig from comment #18) > Created attachment 40119 [details] > Version that works (AVX only) > > Here is a version that should only do AVX stuff on Intel processors. > Optimization for other processor types could come later. This is interesting. This patch works fine on the AMD processors I tested. Looking at the disaasembly the vanilla matmul does use the xmm registers but does not use any vector instructions. Peak with this is about 9.3 gflops. With -mavx and -mprefer-avx128 the peak is 10.0 gflops or about 7.5% improvement. I think get this patch committed and then we can work on the AMD side. I know Steve is running an FX series AMD processor. Once this patch goes in, I will give it a spin there. The FX are clearly better than this generation of APU which is more focused on using the on chip GPU features (which are pretty good) We will also want to keep an eye on the Zen based processors which I expect will behave more like Intel regarding the vector instructions (well we will see anyway)
(In reply to Thomas Koenig from comment #19) > Created attachment 40120 [details] > Updated patch > > Well, here's an update also for AVX512F. > > I can confirm the patch gives the same performance as the AVX > version on a machine that supports AVX. Untested on AVX512, because > I don't have a machine for that. > > Adding AVX2 would be fairly trivial. > > I'm not sure that yanking out the info into the new cpuinfo.h header > file is the way to go, but I am not sure of a better way to do it. > > Other comments? I wonder if there is one in the gcc compile farm. Is the AVX512 a Knights Landing feature? Which machines have it. (time to google)
Author: tkoenig Date: Sat Dec 3 09:44:35 2016 New Revision: 243219 URL: https://gcc.gnu.org/viewcvs?rev=243219&root=gcc&view=rev Log: 2016-12-03 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/78379 * config/i386/cpuinfo.c: Move denums for processor vendors, processor type, processor subtypes and declaration of struct __processor_model into * config/i386/cpuinfo.h: New header file. * Makefile.am: Add dependence of m4/matmul_internal_m4 to mamtul files.. * Makefile.in: Regenerated. * acinclude.m4: Check for AVX, AVX2 and AVX512F. * config.h.in: Add HAVE_AVX, HAVE_AVX2 and HAVE_AVX512F. * configure: Regenerated. * configure.ac: Use checks for AVX, AVX2 and AVX_512F. * m4/matmul_internal.m4: New file. working part of matmul.m4. * m4/matmul.m4: Implement architecture-specific switching for AVX, AVX2 and AVX512F by including matmul_internal.m4 multiple times. * generated/matmul_c10.c: Regenerated. * generated/matmul_c16.c: Regenerated. * generated/matmul_c4.c: Regenerated. * generated/matmul_c8.c: Regenerated. * generated/matmul_i1.c: Regenerated. * generated/matmul_i16.c: Regenerated. * generated/matmul_i2.c: Regenerated. * generated/matmul_i4.c: Regenerated. * generated/matmul_i8.c: Regenerated. * generated/matmul_r10.c: Regenerated. * generated/matmul_r16.c: Regenerated. * generated/matmul_r4.c: Regenerated. * generated/matmul_r8.c: Regenerated. Added: trunk/libgcc/config/i386/cpuinfo.h trunk/libgfortran/m4/matmul_internal.m4 Modified: trunk/libgcc/ChangeLog trunk/libgcc/config/i386/cpuinfo.c trunk/libgfortran/ChangeLog trunk/libgfortran/Makefile.am trunk/libgfortran/Makefile.in trunk/libgfortran/acinclude.m4 trunk/libgfortran/config.h.in trunk/libgfortran/configure trunk/libgfortran/configure.ac trunk/libgfortran/generated/matmul_c10.c trunk/libgfortran/generated/matmul_c16.c trunk/libgfortran/generated/matmul_c4.c trunk/libgfortran/generated/matmul_c8.c trunk/libgfortran/generated/matmul_i1.c trunk/libgfortran/generated/matmul_i16.c trunk/libgfortran/generated/matmul_i2.c trunk/libgfortran/generated/matmul_i4.c trunk/libgfortran/generated/matmul_i8.c trunk/libgfortran/generated/matmul_r10.c trunk/libgfortran/generated/matmul_r16.c trunk/libgfortran/generated/matmul_r4.c trunk/libgfortran/generated/matmul_r8.c trunk/libgfortran/m4/matmul.m4
Timings before r243219 ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 0.020 0.059 0.140 0.181 3 5000 0.475 0.551 0.411 0.531 4 5000 1.011 1.120 0.951 1.131 5 5000 1.446 1.512 1.286 1.490 7 5000 2.481 2.323 2.313 2.573 8 5000 3.511 2.496 3.402 3.678 9 5000 3.575 2.300 2.074 2.694 15 5000 4.395 3.242 5.172 5.299 16 5000 5.907 3.228 5.920 6.009 17 5000 5.445 3.804 4.681 5.489 31 5000 7.133 4.291 7.209 7.304 32 5000 7.984 4.323 7.197 7.580 33 5000 6.739 4.488 7.306 7.377 63 5000 8.718 4.682 8.997 9.170 64 5000 9.667 4.555 9.611 9.882 65 5000 9.263 4.462 9.018 9.418 127 3920 10.378 4.287 10.327 10.296 128 3829 10.960 4.353 10.967 11.138 129 3741 10.343 4.315 10.065 10.440 255 483 11.370 4.522 11.511 11.229 256 477 11.589 4.538 11.841 11.307 257 472 10.983 4.532 10.721 10.955 511 60 11.341 4.476 10.970 11.399 512 59 12.164 4.666 12.257 11.726 513 59 11.044 4.575 11.141 10.582 1023 7 11.059 3.900 11.374 11.313 1024 7 12.030 3.908 11.773 11.275 1025 7 10.912 3.933 10.598 11.072 at r243219 ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 0.096 0.108 0.098 0.125 3 5000 0.353 0.411 0.290 0.355 4 5000 0.779 0.770 0.651 0.846 5 5000 1.176 1.286 1.088 1.193 7 5000 2.089 2.260 1.991 2.142 8 5000 3.232 2.430 3.164 3.486 9 5000 3.380 2.747 3.370 3.575 15 5000 4.668 3.018 4.481 4.692 16 5000 5.184 3.506 5.987 6.404 17 5000 5.747 3.348 5.596 5.774 31 5000 6.995 4.036 7.046 7.040 32 5000 8.822 4.161 7.868 8.076 33 5000 7.778 4.348 8.078 8.090 63 5000 9.600 4.509 9.682 9.367 64 5000 11.616 4.365 11.045 10.845 65 5000 10.434 4.337 10.536 10.558 127 3920 11.975 4.259 12.065 11.979 128 3829 13.767 4.307 12.918 13.469 129 3741 12.370 4.139 11.410 12.350 255 483 13.292 4.462 14.016 14.005 256 477 14.298 4.477 14.312 15.027 257 472 13.436 4.352 13.014 13.565 511 60 13.484 4.574 14.024 13.789 512 59 13.803 4.459 14.284 14.950 513 59 13.094 4.479 13.069 13.234 1023 7 13.952 3.914 14.194 13.873 1024 7 14.636 3.837 14.675 14.987 1025 7 13.649 3.953 13.594 13.701 For reference with -fexternal-blas ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 5000 0.096 0.107 0.091 0.127 3 5000 0.370 0.411 0.293 0.371 4 5000 0.812 0.825 0.692 0.812 5 5000 1.254 1.292 1.117 1.273 7 5000 2.382 2.345 2.295 2.536 8 5000 3.483 2.501 2.804 2.192 9 5000 2.421 2.058 2.574 3.121 15 5000 5.077 3.244 5.233 5.298 16 5000 5.797 3.220 5.799 5.762 17 5000 5.354 2.891 5.287 5.474 31 5000 9.939 4.311 11.991 12.169 32 5000 15.715 4.006 15.851 16.007 33 5000 13.375 4.290 14.441 14.977 63 5000 18.057 4.683 18.372 17.800 64 5000 21.426 4.270 20.842 22.123 65 5000 18.861 4.385 20.410 19.707 127 3920 21.448 4.288 20.904 21.320 128 3829 44.731 4.312 44.129 40.524 129 3741 36.300 4.109 38.858 36.359 255 483 52.876 4.310 57.982 54.261 512 59 59.823 4.688 66.297 60.748 513 59 58.666 4.559 60.481 57.547 1023 7 61.315 3.900 64.559 61.124 1024 7 63.148 3.861 68.033 62.486 1025 7 58.991 3.895 55.074 58.168
Could be a good idea to add a version with -mfma to the flags for AVX2. I'll see what I can do. It might be too late for gcc 7, and I also don't have an AVX2 machine to test on. Might also be a good idea to include this for AVX512F (if it is automatically included).
Author: tkoenig Date: Thu Mar 2 11:04:01 2017 New Revision: 245836 URL: https://gcc.gnu.org/viewcvs?rev=245836&root=gcc&view=rev Log: 2017-03-02 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/78379 * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for reals. Add fma to target options. (matmul_'rtype_code`): Call AVX2 only if FMA is available. * generated/matmul_c10.c: Regenerated. * generated/matmul_c16.c: Regenerated. * generated/matmul_c4.c: Regenerated. * generated/matmul_c8.c: Regenerated. * generated/matmul_i1.c: Regenerated. * generated/matmul_i16.c: Regenerated. * generated/matmul_i2.c: Regenerated. * generated/matmul_i4.c: Regenerated. * generated/matmul_i8.c: Regenerated. * generated/matmul_r10.c: Regenerated. * generated/matmul_r16.c: Regenerated. * generated/matmul_r4.c: Regenerated. * generated/matmul_r8.c: Regenerated. Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/generated/matmul_c10.c trunk/libgfortran/generated/matmul_c16.c trunk/libgfortran/generated/matmul_c4.c trunk/libgfortran/generated/matmul_c8.c trunk/libgfortran/generated/matmul_i1.c trunk/libgfortran/generated/matmul_i16.c trunk/libgfortran/generated/matmul_i2.c trunk/libgfortran/generated/matmul_i4.c trunk/libgfortran/generated/matmul_i8.c trunk/libgfortran/generated/matmul_r10.c trunk/libgfortran/generated/matmul_r16.c trunk/libgfortran/generated/matmul_r4.c trunk/libgfortran/generated/matmul_r8.c trunk/libgfortran/m4/matmul.m4
What is AVX-specific, as opposed to SIMD vector size-specific, about this feature? It seems that this should be enabled for all SIMD architectures of the appropriate width.
(In reply to # David Edelsohn from comment #26) > What is AVX-specific, as opposed to SIMD vector size-specific, about this > feature? It seems that this should be enabled for all SIMD architectures of > the appropriate width. You're right, this might as well apply to other architectures where SIMD instructions are available only on some architectures, but cannot be turned on by default because they are not universally implemented. I would need three pieces of information: - What to put into the libgfortran config file to check if the installed binutils support the SIMD extension in question - How to check at runtime for the specific processor version - Which options to pass to __attribute__((__target__ .. Then it is relatively straightforward to put this in.
Because PPC64LE Linux reset the base ISA level, VSX now is enabled by default, so a function clone for VSX probably isn't necessary. While special versions might help AIX and PPC64BE, with lower ISA defaults, those are not the focus.
(In reply to David Edelsohn from comment #28) > Because PPC64LE Linux reset the base ISA level, VSX now is enabled by > default, so a function clone for VSX probably isn't necessary. While > special versions might help AIX and PPC64BE, with lower ISA defaults, those > are not the focus. What about ARM NEON? Is this time of the normal ISA level?
I think there still is one thing to do. Apparently, AMD CPUs (which use only vanilla at the moment) are slightly faster with -mprefer-avx128, and they should be much faster if they have FMA3. Unless I missed something, it is not possible to specify something like -mprefer-avx128 as a target attribute. What would be the best way to go about this?
Created attachment 41405 [details] Patch for AMD Here's a proposed patch for AMDs. This does AVX128 and FMA when both are available, or AVX128 and FMA4, or nothing. Rationale is that AVX128 alone does not do a lot for AMD processors. The new files will come as a separate attachment.
Created attachment 41406 [details] Additional files for the previous patch Here are the new files for the patch.
(In reply to Thomas Koenig from comment #32) > Created attachment 41406 [details] > Additional files for the previous patch > > Here are the new files for the patch. Well I tried to apply the patch and test without using maintainer mode. Running my tests in the debugger, breaking and dis-assembly shows xmmm instructions and calls to matmul_vanilla so I think I need to enable maintainer mode and rebuild, or something is not quite right. Suggestions?
Created attachment 41410 [details] Patch which has all the files Well, I suspect my way of splitting the previous patch into one real patch and one *.tar.gz - file was not really the best way to go :-) Here is a patch which should include all the new files. At least it fits into the 1000 kb limit.
(In reply to Thomas Koenig from comment #34) > Created attachment 41410 [details] > Patch which has all the files > > Well, I suspect my way of splitting the previous patch into > one real patch and one *.tar.gz - file was not really the best way > to go :-) > > Here is a patch which should include all the new files. > > At least it fits into the 1000 kb limit. I am finishing a build in maintainer mode so will try the first approach and if that fails, will try the new patch. Everything looks reasonable, just think we should test on my AMD boxes.
Results look very good. Gfortran 7, no patch gives: $ gfc7 -static -Ofast -ftree-vectorize compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 4.706 0.046 0.094 0.162 4 2000 1.246 0.246 0.305 0.351 8 2000 1.410 0.605 0.958 1.791 16 2000 5.413 2.787 2.228 2.615 32 2000 4.676 3.416 4.622 4.618 64 2000 6.368 2.652 6.339 6.167 128 2000 8.165 2.998 8.118 8.260 256 477 9.334 3.202 9.248 9.355 512 59 8.730 2.239 8.596 8.730 1024 7 8.805 1.378 8.673 8.812 2048 1 8.781 1.728 8.649 8.789 Latest gfortran trunk with patch gives: $ gfc -static -Ofast -ftree-vectorize compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 4.738 0.048 0.092 0.172 4 2000 1.438 0.248 0.305 0.378 8 2000 1.511 0.617 1.177 1.955 16 2000 5.426 2.810 1.854 2.881 32 2000 4.688 3.314 4.357 5.091 64 2000 6.669 2.674 6.629 7.110 128 2000 9.139 3.000 9.076 9.131 256 477 10.495 3.184 10.466 10.516 512 59 9.577 2.189 9.477 9.635 1024 7 9.593 1.381 9.519 9.658 2048 1 9.722 1.709 9.625 9.785
Author: tkoenig Date: Thu May 25 21:51:27 2017 New Revision: 248472 URL: https://gcc.gnu.org/viewcvs?rev=248472&root=gcc&view=rev Log: 2017-05-25 Thomas Koenig <tkoenig@gcc.gnu.org> PR libfortran/78379 * Makefile.am: Add generated/matmulavx128_*.c files. Handle them for compiling and setting the right flags. * acinclude.m4: Add tests for FMA3, FMA4 and AVX128. * configure.ac: Call them. * Makefile.in: Regenerated. * config.h.in: Regenerated. * configure: Regenerated. * m4/matmul.m4: Handle AMD chips by calling 128-bit AVX versions which use FMA3 or FMA4. * m4/matmulavx128.m4: New file. * generated/matmul_c10.c: Regenerated. * generated/matmul_c16.c: Regenerated. * generated/matmul_c4.c: Regenerated. * generated/matmul_c8.c: Regenerated. * generated/matmul_i1.c: Regenerated. * generated/matmul_i16.c: Regenerated. * generated/matmul_i2.c: Regenerated. * generated/matmul_i4.c: Regenerated. * generated/matmul_i8.c: Regenerated. * generated/matmul_r10.c: Regenerated. * generated/matmul_r16.c: Regenerated. * generated/matmul_r4.c: Regenerated. * generated/matmul_r8.c: Regenerated. * generated/matmulavx128_c10.c: New file. * generated/matmulavx128_c16.c: New file. * generated/matmulavx128_c4.c: New file. * generated/matmulavx128_c8.c: New file. * generated/matmulavx128_i1.c: New file. * generated/matmulavx128_i16.c: New file. * generated/matmulavx128_i2.c: New file. * generated/matmulavx128_i4.c: New file. * generated/matmulavx128_i8.c: New file. * generated/matmulavx128_r10.c: New file. * generated/matmulavx128_r16.c: New file. * generated/matmulavx128_r4.c: New file. * generated/matmulavx128_r8.c: New file. Added: trunk/libgfortran/generated/matmulavx128_c10.c trunk/libgfortran/generated/matmulavx128_c16.c trunk/libgfortran/generated/matmulavx128_c4.c trunk/libgfortran/generated/matmulavx128_c8.c trunk/libgfortran/generated/matmulavx128_i1.c trunk/libgfortran/generated/matmulavx128_i16.c trunk/libgfortran/generated/matmulavx128_i2.c trunk/libgfortran/generated/matmulavx128_i4.c trunk/libgfortran/generated/matmulavx128_i8.c trunk/libgfortran/generated/matmulavx128_r10.c trunk/libgfortran/generated/matmulavx128_r16.c trunk/libgfortran/generated/matmulavx128_r4.c trunk/libgfortran/generated/matmulavx128_r8.c trunk/libgfortran/m4/matmulavx128.m4 Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/Makefile.am trunk/libgfortran/Makefile.in trunk/libgfortran/acinclude.m4 trunk/libgfortran/config.h.in trunk/libgfortran/configure trunk/libgfortran/configure.ac trunk/libgfortran/generated/matmul_c10.c trunk/libgfortran/generated/matmul_c16.c trunk/libgfortran/generated/matmul_c4.c trunk/libgfortran/generated/matmul_c8.c trunk/libgfortran/generated/matmul_i1.c trunk/libgfortran/generated/matmul_i16.c trunk/libgfortran/generated/matmul_i2.c trunk/libgfortran/generated/matmul_i4.c trunk/libgfortran/generated/matmul_i8.c trunk/libgfortran/generated/matmul_r10.c trunk/libgfortran/generated/matmul_r16.c trunk/libgfortran/generated/matmul_r4.c trunk/libgfortran/generated/matmul_r8.c trunk/libgfortran/m4/matmul.m4
This works for Intel and AMD now. If anybody wants another architecture, we know how to do it. Closing.