Bug 78379 - Processor-specific versions for matmul
Summary: Processor-specific versions for matmul
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libfortran (show other bugs)
Version: 7.0
: P3 enhancement
Target Milestone: ---
Assignee: Thomas Koenig
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-16 12:32 UTC by Thomas Koenig
Modified: 2017-05-26 05:20 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-12-03 00:00:00


Attachments
Test program for benchmarks (4.57 KB, text/plain)
2016-11-17 17:52 UTC, Thomas Koenig
Details
Version that works (AVX only) (16.40 KB, application/x-gzip)
2016-11-22 17:01 UTC, Thomas Koenig
Details
Updated patch (17.59 KB, patch)
2016-11-22 20:41 UTC, Thomas Koenig
Details | Diff
Patch for AMD (6.33 KB, patch)
2017-05-22 16:44 UTC, Thomas Koenig
Details | Diff
Additional files for the previous patch (18.11 KB, application/gzip)
2017-05-22 16:46 UTC, Thomas Koenig
Details
Patch which has all the files (28.37 KB, patch)
2017-05-24 06:13 UTC, Thomas Koenig
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Koenig 2016-11-16 12:32:41 UTC
Now that a patch for PR51119 is in, we can think about
inserting processor-specific versions.

target_clones looks to be a good idea for this, see

https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Common-Function-Attributes.html

There are still a few issues to be resolved, for example which
architectures to chose. Also, selecting an architecture which does
not exist on the platform leads to errors, so we probably need
to guard with appropriate #ifdefs.

A wrapper function to call the actual matmul is probably a good idea,
because it is the caller who generates the code to select.
Comment 1 Thomas Koenig 2016-11-17 17:52:27 UTC
Created attachment 40074 [details]
Test program for benchmarks
Comment 2 Thomas Koenig 2016-11-17 17:53:10 UTC
Here are some measurements with the AVX-enabling patch.
They were done on an AVX machine, namely gcc75 from the compile farm.

This was done with the command line

gfortran -static-libgfortran -finline-matmul-limit=0 -Ofast -o compare_mavx compare_2.f90

Uncontidionally setting -mavx in the Makefile for matmul, with stock trunk:

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      0.067      0.077      0.051      0.069
    3  5000      0.193      0.218      0.157      0.194
    4  5000      0.429      0.423      0.368      0.435
    5  5000      0.609      0.659      0.556      0.630
    7  5000      0.948      1.018      0.931      1.009
    8  5000      1.608      1.251      1.589      1.715
    9  5000      1.755      1.484      1.745      1.856
   15  5000      2.710      2.175      2.963      3.105
   16  5000      4.289      2.510      4.541      4.784
   17  5000      4.411      3.032      4.675      4.888
   31  5000      6.165      4.395      6.912      6.902
   32  5000      8.800      4.362      8.793      8.809
   33  5000      8.156      4.463      8.145      8.193
   63  5000      9.727      4.364      9.709      9.716
   64  5000     11.828      4.023     11.810     11.798
   65  5000     10.726      4.489     10.654     10.725
  127  3920     12.144      4.292     12.281     12.268
  128  3829     13.829      4.484     13.807     13.841
  129  3741     12.986      4.438     12.964     12.985
  255   483     14.446      4.571     14.462     14.442
  256   477     15.738      4.707     15.744     15.738
  257   472     13.981      4.565     13.995     13.990
  511    60     14.954      4.674     14.977     14.933
  512    59     16.120      4.840     16.137     16.062
  513    59     14.488      4.392     14.497     14.490
 1023     7     15.011      3.573     15.021     14.995
 1024     7     15.938      3.489     15.947     15.938
 1025     7     14.670      3.568     14.683     14.627

With library-side switching (https://gcc.gnu.org/ml/gcc-patches/2016-11/msg01810.html):

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      0.067      0.080      0.053      0.067
    3  5000      0.192      0.226      0.159      0.192
    4  5000      0.427      0.436      0.364      0.431
    5  5000      0.588      0.664      0.543      0.621
    7  5000      0.938      0.914      0.926      1.011
    8  5000      1.589      1.235      1.558      1.671
    9  5000      1.704      1.486      1.694      1.810
   15  5000      2.638      2.175      2.854      3.031
   16  5000      4.234      2.532      4.533      4.745
   17  5000      4.374      3.044      4.677      4.839
   31  5000      6.207      4.401      6.891      6.918
   32  5000      8.824      4.364      8.614      8.603
   33  5000      7.954      4.349      7.945      7.944
   63  5000      8.802      4.369      9.728      9.764
   64  5000     11.845      4.025     11.783     11.849
   65  5000     10.753      4.595     10.719     10.753
  127  3920     12.023      4.314     12.285     12.004
  128  3829     13.427      4.369     13.722     13.742
  129  3741     12.877      4.323     12.668     12.985
  255   483     14.398      4.453     14.336     13.496
  256   477     15.708      4.680     15.711     15.465
  257   472     13.977      4.439     13.965     13.977
  511    60     14.920      4.691     14.937     14.939
  512    59     15.959      4.787     16.084     16.082
  513    59     14.444      4.636     14.464     14.452
 1023     7     14.978      3.448     14.979     14.980
 1024     7     15.903      3.640     15.900     15.905
 1025     7     14.638      3.464     14.626     14.636

With stock trunk:

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      0.072      0.078      0.053      0.072
    3  5000      0.199      0.224      0.165      0.200
    4  5000      0.458      0.403      0.387      0.462
    5  5000      0.629      0.661      0.563      0.651
    7  5000      1.073      1.010      1.029      1.131
    8  5000      1.671      1.234      1.637      1.760
    9  5000      1.732      1.465      1.720      1.829
   15  5000      2.895      2.152      3.195      3.349
   16  5000      3.870      2.483      4.168      4.318
   17  5000      3.976      3.029      4.253      4.424
   31  5000      6.210      4.403      6.861      6.868
   32  5000      7.551      4.293      7.544      7.509
   33  5000      7.119      4.418      7.094      7.090
   63  5000      8.742      4.377      8.753      8.728
   64  5000      9.415      4.019      9.384      9.260
   65  5000      8.882      4.540      8.842      8.856
  127  3920     10.073      4.432      9.966      9.988
  128  3829     10.556      4.469     10.552     10.405
  129  3741      9.923      4.428      9.990      9.930
  255   483     10.827      4.569     10.875     10.768
  256   477     11.328      4.705     11.281     11.129
  257   472     10.402      4.492     10.344     10.360
  511    60     10.947      4.674     11.003     10.938
  512    59     11.503      4.842     11.504     11.314
  513    59     10.654      4.672     10.651     10.619
 1023     7     10.941      3.641     10.944     10.863
 1024     7     11.370      3.587     11.261     11.193
 1025     7     10.734      3.601     10.652     10.704

With inlined, -Ofast without -mavx:

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      8.979      0.078      0.154      0.241
    3  5000     14.042      0.224      0.348      0.451
    4  5000      1.686      0.435      0.500      0.707
    5  5000      1.989      0.617      0.577      0.829
    7  5000      2.163      0.846      0.783      1.123
    8  5000      3.742      1.224      0.879      1.322
    9  5000      2.764      1.420      0.996      1.458
   15  5000      3.461      2.108      1.305      2.420
   16  5000      4.395      2.589      1.619      2.901
   17  5000      5.238      3.291      1.934      3.579
   31  5000      7.207      4.434      2.347      4.385
   32  5000      7.318      4.306      2.351      4.329
   33  5000      7.204      4.466      2.052      4.421
   63  5000      4.688      4.365      2.486      4.700
   64  5000      4.246      4.022      2.480      4.664
   65  5000      4.238      4.355      2.486      4.703
  127  3920      4.411      4.427      2.821      4.340
  128  3829      4.365      4.481      2.846      4.434
  129  3741      4.427      4.441      2.828      4.396
  255   483      4.561      4.569      2.972      4.517
  256   477      4.666      4.701      2.905      4.685
  257   472      4.520      4.573      2.974      4.550
  511    60      4.669      4.675      3.075      4.666
  512    59      4.823      4.843      3.095      4.835
  513    59      4.655      4.672      3.077      4.651
 1023     7      3.555      3.563      2.718      3.554
 1024     7      3.519      3.529      2.713      3.519
 1025     7      3.527      3.543      2.715      3.536

With inline version with -mavx:

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      8.990      0.074      0.155      0.206
    3  5000      7.488      0.212      0.304      0.396
    4  5000      1.773      0.342      0.501      0.533
    5  5000      2.000      0.552      0.615      0.739
    7  5000      2.163      0.919      0.807      1.057
    8  5000      3.369      1.388      0.905      1.578
    9  5000      2.694      1.347      1.020      1.492
   15  5000      3.441      2.201      1.325      2.631
   16  5000      1.831      3.399      1.677      4.137
   17  5000      4.554      3.461      1.976      4.120
   31  5000      7.111      5.286      2.372      5.712
   32  5000      8.384      5.887      2.040      6.725
   33  5000      7.218      5.374      2.057      5.798
   63  5000      8.131      6.107      2.477      6.418
   64  5000      8.707      6.518      2.313      7.228
   65  5000      7.768      6.003      2.427      4.503
  127  3920      6.714      5.688      2.761      6.293
  128  3829      7.067      6.688      2.777      6.880
  129  3741      6.277      6.023      2.765      6.296
  255   483      6.036      5.681      2.877      5.765
  256   477      6.177      5.869      2.921      5.917
  257   472      6.017      5.687      2.880      5.766
  511    60      6.156      5.878      2.848      5.920
  512    59      6.338      6.107      3.026      6.092
  513    59      6.125      5.826      2.954      5.817
 1023     7      4.130      4.111      2.623      4.104
 1024     7      4.270      4.219      2.667      4.198
 1025     7      4.206      4.159      2.616      4.149
Comment 3 Jerry DeLisle 2016-11-17 19:57:21 UTC
I did apply your second patch:

I do not get any improvement and results are diminished from current trunk, so I am missing something. This is same machine I used showing results in 51119. It does have avx.

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

$ gfc -static-libgfortran -finline-matmul-limit=0 -Ofast -o compare_mavx compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      5.043      0.045      0.091      0.150
    4  2000      1.417      0.235      0.353      0.325
    8  2000      2.016      0.634      0.862      2.021
   16  2000      5.332      2.834      2.239      2.929
   32  2000      6.169      3.496      1.931      3.289
   64  2000      2.656      2.836      2.655      2.657
  128  2000      2.898      3.286      2.901      2.901
  256   477      3.157      3.429      3.156      3.157
  512    59      3.082      2.356      3.133      3.126
 1024     7      3.102      1.363      3.144      3.136
 2048     1      3.099      1.685      3.144      3.140
Comment 4 Uroš Bizjak 2016-11-17 20:14:12 UTC
(In reply to Jerry DeLisle from comment #3)
> I did apply your second patch:
> 
> I do not get any improvement and results are diminished from current trunk,
> so I am missing something. This is same machine I used showing results in
> 51119. It does have avx.

You have AMD processor, can you try with -mprefer-avx128 option?
Comment 5 Jerry DeLisle 2016-11-17 22:56:57 UTC
(In reply to Uroš Bizjak from comment #4)
> (In reply to Jerry DeLisle from comment #3)
> > I did apply your second patch:
> > 
> > I do not get any improvement and results are diminished from current trunk,
> > so I am missing something. This is same machine I used showing results in
> > 51119. It does have avx.
> 
> You have AMD processor, can you try with -mprefer-avx128 option?

You may notice I was invoking the wrong executable in what I posted in comment #3. I did rerun the correct one several times and tried it with -mavx -mprefer-avx128. I get the same poor results regardless.
Comment 6 Thomas Koenig 2016-11-17 23:21:36 UTC
> You may notice I was invoking the wrong executable in what I posted in
> comment #3. I did rerun the correct one several times and tried it with
> -mavx -mprefer-avx128. I get the same poor results regardless.

Several things could go wrong here...

If you run the benchmark under gdb and break, then type
"disassemble $pc,$pc+200", do you actually end up in the right
program part (the one with AVX instructions)?

Or does your machine prefer AVX128?

To find out, what are the timings for inline code using

-mavx -Ofast

-mavx -mprefer=avx128 -Ofast

?
Comment 7 Thomas Koenig 2016-11-17 23:25:08 UTC
And one more thing.

Comparing the timing you get for the version with the target_clone
and a version using just -mavx added to the relevant line in the
Makefile, do you see a difference?
Comment 8 Jerry DeLisle 2016-11-18 03:20:32 UTC
(In reply to Thomas Koenig from comment #6)
> > You may notice I was invoking the wrong executable in what I posted in
> > comment #3. I did rerun the correct one several times and tried it with
> > -mavx -mprefer-avx128. I get the same poor results regardless.
> 
> Several things could go wrong here...
> 
> If you run the benchmark under gdb and break, then type
> "disassemble $pc,$pc+200", do you actually end up in the right
> program part (the one with AVX instructions)?

452				      f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
(gdb) disassemble $pc,$pc+200
Dump of assembler code from 0x7ffff7af3554 to 0x7ffff7af361c:
=> 0x00007ffff7af3554 <aux_matmul_r8+5220>:	vaddpd %ymm12,%ymm4,%ymm4
   0x00007ffff7af3559 <aux_matmul_r8+5225>:	vmulpd %ymm10,%ymm15,%ymm12
   0x00007ffff7af355e <aux_matmul_r8+5230>:	vaddpd %ymm11,%ymm5,%ymm5
   0x00007ffff7af3563 <aux_matmul_r8+5235>:	vmulpd %ymm14,%ymm15,%ymm15
   0x00007ffff7af3568 <aux_matmul_r8+5240>:	vmulpd %ymm10,%ymm13,%ymm10
   0x00007ffff7af356d <aux_matmul_r8+5245>:	vaddpd %ymm12,%ymm6,%ymm6
   0x00007ffff7af3572 <aux_matmul_r8+5250>:	vmulpd %ymm14,%ymm13,%ymm14
   0x00007ffff7af3577 <aux_matmul_r8+5255>:	vaddpd %ymm15,%ymm8,%ymm8
   0x00007ffff7af357c <aux_matmul_r8+5260>:	vaddpd %ymm10,%ymm7,%ymm7
   0x00007ffff7af3581 <aux_matmul_r8+5265>:	vaddpd %ymm14,%ymm9,%ymm9
   0x00007ffff7af3586 <aux_matmul_r8+5270>:	ja     0x7ffff7af3433 <aux_matmul_r8+4931>
   0x00007ffff7af358c <aux_matmul_r8+5276>:	mov    -0x801f8(%rbp),%rdx
   0x00007ffff7af3593 <aux_matmul_r8+5283>:	vhaddpd %ymm9,%ymm9,%ymm13
   0x00007ffff7af3598 <aux_matmul_r8+5288>:	vhaddpd %ymm8,%ymm8,%ymm15
   0x00007ffff7af359d <aux_matmul_r8+5293>:	vhaddpd %ymm7,%ymm7,%ymm7
   0x00007ffff7af35a1 <aux_matmul_r8+5297>:	vperm2f128 $0x1,%ymm13,%ymm13,%ymm11
   0x00007ffff7af35a7 <aux_matmul_r8+5303>:	vhaddpd %ymm5,%ymm5,%ymm5
   0x00007ffff7af35ab <aux_matmul_r8+5307>:	vperm2f128 $0x1,%ymm15,%ymm15,%ymm8
   0x00007ffff7af35b1 <aux_matmul_r8+5313>:	vaddpd %ymm11,%ymm13,%ymm12
   0x00007ffff7af35b6 <aux_matmul_r8+5318>:	vperm2f128 $0x1,%ymm7,%ymm7,%ymm13
   0x00007ffff7af35bc <aux_matmul_r8+5324>:	vaddpd %ymm8,%ymm15,%ymm14
   0x00007ffff7af35c1 <aux_matmul_r8+5329>:	vhaddpd %ymm6,%ymm6,%ymm6
---Type <return> to continue, or q <return> to quit---
   0x00007ffff7af35c5 <aux_matmul_r8+5333>:	vaddsd -0x80068(%rbp),%xmm12,%xmm10
   0x00007ffff7af35cd <aux_matmul_r8+5341>:	vaddsd -0x80070(%rbp),%xmm14,%xmm9
   0x00007ffff7af35d5 <aux_matmul_r8+5349>:	vperm2f128 $0x1,%ymm5,%ymm5,%ymm14
   0x00007ffff7af35db <aux_matmul_r8+5355>:	vhaddpd %ymm4,%ymm4,%ymm4
   0x00007ffff7af35df <aux_matmul_r8+5359>:	vaddpd %ymm13,%ymm7,%ymm11
   0x00007ffff7af35e4 <aux_matmul_r8+5364>:	vmovsd %xmm10,-0x80068(%rbp)
   0x00007ffff7af35ec <aux_matmul_r8+5372>:	vperm2f128 $0x1,%ymm6,%ymm6,%ymm10
   0x00007ffff7af35f2 <aux_matmul_r8+5378>:	vperm2f128 $0x1,%ymm4,%ymm4,%ymm13
   0x00007ffff7af35f8 <aux_matmul_r8+5384>:	vmovsd %xmm9,-0x80070(%rbp)
   0x00007ffff7af3600 <aux_matmul_r8+5392>:	vaddpd %ymm14,%ymm5,%ymm9
   0x00007ffff7af3605 <aux_matmul_r8+5397>:	vhaddpd %ymm0,%ymm0,%ymm0
   0x00007ffff7af3609 <aux_matmul_r8+5401>:	vaddsd -0x80058(%rbp),%xmm11,%xmm12
   0x00007ffff7af3611 <aux_matmul_r8+5409>:	vaddpd %ymm10,%ymm6,%ymm15
   0x00007ffff7af3616 <aux_matmul_r8+5414>:	vaddpd %ymm13,%ymm4,%ymm11
   0x00007ffff7af361b <aux_matmul_r8+5419>:	vperm2f128 $0x1,%ymm0,%ymm0,%ymm13
End of assembler dump.



> 
> Or does your machine prefer AVX128?
> 
> To find out, what are the timings for inline code using
> 
> -mavx -Ofast
> 
> -mavx -mprefer=avx128 -Ofast
> 
> ?
$ gfc  -finline-matmul-limit=64 -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      4.933      0.045      0.086      0.144
    4  2000      1.418      0.225      0.271      0.347
    8  2000      2.168      0.616      1.296      1.830
   16  2000      5.330      2.824      1.784      2.907
   32  2000      6.239      3.488      1.446      3.406
   64  2000      2.650      2.746      1.552      2.691

$ gfc  -finline-matmul-limit=64 -mavx -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      6.934      0.042      0.091      0.134
    4  2000      1.320      0.181      0.365      0.252
    8  2000      1.007      0.446      1.595      0.982
   16  2000      0.581      1.163      2.411      1.180
   32  2000      1.346      1.276      2.061      1.277
   64  2000      1.397      1.327      2.288      1.328

$ gfc  -finline-matmul-limit=64 -mavx -mprefer-avx128 -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      5.021      0.045      0.088      0.139
    4  2000      1.607      0.202      0.288      0.341
    8  2000      2.482      0.575      0.743      1.861
   16  2000      5.674      2.804      1.809      2.792
   32  2000      6.323      3.460      1.478      3.293
   64  2000      2.714      2.832      1.582      2.694

If I put -mavx -prefer-avx128 in the Makefile.am I get as good or better than without your patch. I also see none of the HAVE_AVX defined in config.

$ gfc  -finline-matmul-limit=0 -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.043      0.041      0.034      0.043
    4  2000      0.272      0.234      0.223      0.256
    8  2000      0.835      1.687      1.627      1.709
   16  2000      2.886      2.887      2.859      2.869
   32  2000      4.733      3.494      4.755      4.652
   64  2000      6.933      2.837      6.933      6.877
  128  2000      7.949      3.285      8.705      7.914
  256   477     10.040      3.447      9.999      9.951
  512    59      8.885      2.341      8.923      8.940
 1024     7      8.937      1.367      8.978      8.991
 2048     1      8.799      1.672      8.831      8.854

The following in config.h.in for what it is worth:

/* Define if AVX instructions can be compiled. */
#undef HAVE_AVX

/* Define if AVX2 instructions can be compiled. */
#undef HAVE_AVX2

/* Define if AVX512f instructions can be compiled. */
#undef HAVE_AVX512F
Comment 9 Thomas Koenig 2016-11-18 16:38:36 UTC
Next question - what happens if you add

-mvzeroupper -mavx

to the line in the Makefile?  Does that make a difference in speed?
Comment 10 Jerry DeLisle 2016-11-18 17:26:57 UTC
(In reply to Thomas Koenig from comment #9)
> Next question - what happens if you add
> 
> -mvzeroupper -mavx
> 
> to the line in the Makefile?  Does that make a difference in speed?

-mvzeroupper slows all way down with or without -mprefer-avx128
Comment 11 Jerry DeLisle 2016-11-18 17:36:51 UTC
One could consider running a reference matrix multiply of size 32 in a loop and do timing tests to determine whether to use -mprefer-avx128. 0n this machine from comment 8

mavx = 1.276     mavx mprefer-avx128 = 3.460

There is some margin there for a fairly good test. Or is there another way to tell?
Comment 12 Thomas Koenig 2016-11-19 10:33:29 UTC
(In reply to Jerry DeLisle from comment #11)
> One could consider running a reference matrix multiply of size 32 in a loop
> and do timing tests to determine whether to use -mprefer-avx128. 0n this
> machine from comment 8
> 
> mavx = 1.276     mavx mprefer-avx128 = 3.460
> 
> There is some margin there for a fairly good test. Or is there another way
> to tell?

I read some advice on the net that certain types of AMD processors
have AVX, but AVX128 is better for them.

What exactly is your CPU model?  What does /proc/cpuinfo say?

gcc determines the cpu model (see runk/libgcc/config/i386/cpuinfo.c).
We should be able to query the CPU model and dispatch for AVX128
or AVX (or the other variants) based on that.
Comment 13 Thomas Koenig 2016-11-19 13:27:52 UTC
OK, I think I have a rough idea how to do this.

For querying the CPU model, we need to put the interface in
libgcc/config/i386/cpuinfo.c into a separate header.

Then we generate a list of matmul functions using m4, with
a second parameter, which gives us the architecture, such as in

$(M4) -Dfile=$@ -Darch=avx512f ...

In the generated C files, we enclose the whole content inside HAVE_AVX512F,
so nothing happens if the architecture is not supported by the compiler.
The target attribute is also set there.

On the first call to matmul, we check for the availability of AVX
etc, we also check for prefrences such as AVX128 from the CPU model,
and then set a static function pointer to the function we want to call.
On each subsequent invocation, all we do is that (tail) call.

How does this sound?
Comment 14 Jerry DeLisle 2016-11-19 17:24:28 UTC
(In reply to Thomas Koenig from comment #12)
> I read some advice on the net that certain types of AMD processors
> have AVX, but AVX128 is better for them.
> 
> What exactly is your CPU model?  What does /proc/cpuinfo say?
> 

I have three different machines here. I am sure they are all similar as they are A series. The first is for testing results posted here:

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 16
model name	: AMD A10-5800K APU with Radeon(tm) HD Graphics
stepping	: 1

2nd:

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 18
model		: 1
model name	: AMD A6-3620 APU with Radeon(tm) HD Graphics
stepping	: 0

3rd:

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 22
model		: 48
model name	: AMD A8-6410 APU with AMD Radeon R5 Graphics
stepping	: 1


(In reply to Thomas Koenig from comment #13)
> $(M4) -Dfile=$@ -Darch=avx512f ...
> 
> In the generated C files, we enclose the whole content inside HAVE_AVX512F,
> so nothing happens if the architecture is not supported by the compiler.
> The target attribute is also set there.
> 
> On the first call to matmul, we check for the availability of AVX
> etc, we also check for prefrences such as AVX128 from the CPU model,
> and then set a static function pointer to the function we want to call.
> On each subsequent invocation, all we do is that (tail) call.
> 
> How does this sound?

This seems a bit complicated. The machines I have do OK without the aux-matmul and no machine specific compilation other than the current defaults that gcc uses with the flags I have inside the Makefile on current trunk. Can this be done without the first call to matmul?
Comment 15 Thomas Koenig 2016-11-20 10:25:50 UTC
OMG, the world of processors is more complicated than I thought.
So, these rather modern AMD chips support AVX, but suck at it.

Two questions:

- Can you check if -mfma3 and/or -mfma4 make any difference?

- If you start any program compiled with -g under the debugger, break
  anywhere (for example at the beginning of the main program)
  and do a "p __cpu_model", what do you get?

I am halfway tempted to restrict the AVX* stuff to Intel processors
only.  At least, this way we will not make things worse for AMD
processors.
Comment 16 Jerry DeLisle 2016-11-20 22:06:57 UTC
(In reply to Thomas Koenig from comment #15)
> OMG, the world of processors is more complicated than I thought.
> So, these rather modern AMD chips support AVX, but suck at it.
> 
> Two questions:
> 
> - Can you check if -mfma3 and/or -mfma4 make any difference?
> 
> - If you start any program compiled with -g under the debugger, break
>   anywhere (for example at the beginning of the main program)
>   and do a "p __cpu_model", what do you get?

The A10-5800K
p __cpu_model
$1 = {__cpu_vendor = 2, __cpu_type = 5, __cpu_subtype = 8, 
  __cpu_features = {883711}}


The A8:
p __cpu_model
$2 = {__cpu_vendor = 2, __cpu_type = 9, __cpu_subtype = 0, 
  __cpu_features = {855039}}


The A6:
p __cpu_model
$1 = {__cpu_vendor = 2, __cpu_type = 0, __cpu_subtype = 0, 
  __cpu_features = {2111}}

neither -mfma nor -mfma4 help
Comment 17 Jerry DeLisle 2016-11-20 23:02:19 UTC
On a hunch, this brings it back.

$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4 -march=native

So -march=native fixes it. not quite as fast as -prefer-avx128, but close enough
Comment 18 Thomas Koenig 2016-11-22 17:01:02 UTC
Created attachment 40119 [details]
Version that works (AVX only)

Here is a version that should only do AVX stuff on Intel processors.
Optimization for other processor types could come later.
Comment 19 Thomas Koenig 2016-11-22 20:41:44 UTC
Created attachment 40120 [details]
Updated patch

Well, here's an update also for AVX512F.

I can confirm the patch gives the same performance as the AVX
version on a machine that supports AVX.  Untested on AVX512, because
I don't have a machine for that.

Adding AVX2 would be fairly trivial.

I'm not sure that yanking out the info into the new cpuinfo.h header
file is the way to go, but I am not sure of a better way to do it.

Other comments?
Comment 20 Jerry DeLisle 2016-11-22 20:53:41 UTC
(In reply to Thomas Koenig from comment #18)
> Created attachment 40119 [details]
> Version that works (AVX only)
> 
> Here is a version that should only do AVX stuff on Intel processors.
> Optimization for other processor types could come later.

This is interesting. This patch works fine on the AMD processors I tested.

Looking at the disaasembly the vanilla matmul does use the xmm registers but does not use any vector instructions. Peak with this is about 9.3 gflops.

With -mavx and -mprefer-avx128 the peak is 10.0 gflops or about 7.5% improvement.

I think get this patch committed and then we can work on the AMD side. I know Steve is running an FX series AMD processor. Once this patch goes in, I will give it a spin there. The FX are clearly better than this generation of APU which is more focused on using the on chip GPU features (which are pretty good)

We will also want to keep an eye on the Zen based processors which I expect will behave more like Intel regarding the vector instructions (well we will see anyway)
Comment 21 Jerry DeLisle 2016-11-22 20:56:07 UTC
(In reply to Thomas Koenig from comment #19)
> Created attachment 40120 [details]
> Updated patch
> 
> Well, here's an update also for AVX512F.
> 
> I can confirm the patch gives the same performance as the AVX
> version on a machine that supports AVX.  Untested on AVX512, because
> I don't have a machine for that.
> 
> Adding AVX2 would be fairly trivial.
> 
> I'm not sure that yanking out the info into the new cpuinfo.h header
> file is the way to go, but I am not sure of a better way to do it.
> 
> Other comments?

I wonder if there is one in the gcc compile farm. Is the AVX512 a Knights Landing feature? Which machines have it. (time to google)
Comment 22 Thomas Koenig 2016-12-03 09:45:06 UTC
Author: tkoenig
Date: Sat Dec  3 09:44:35 2016
New Revision: 243219

URL: https://gcc.gnu.org/viewcvs?rev=243219&root=gcc&view=rev
Log:
2016-12-03  Thomas Koenig  <tkoenig@gcc.gnu.org>

        PR fortran/78379
        * config/i386/cpuinfo.c:  Move denums for processor vendors,
        processor type, processor subtypes and declaration of
        struct __processor_model into
        * config/i386/cpuinfo.h:  New header file.
        * Makefile.am:  Add dependence of m4/matmul_internal_m4 to
        mamtul files..
        * Makefile.in:  Regenerated.
        * acinclude.m4:  Check for AVX, AVX2 and AVX512F.
        * config.h.in:  Add HAVE_AVX, HAVE_AVX2 and HAVE_AVX512F.
        * configure:  Regenerated.
        * configure.ac:  Use checks for AVX, AVX2 and AVX_512F.
        * m4/matmul_internal.m4:  New file. working part of matmul.m4.
        * m4/matmul.m4:  Implement architecture-specific switching
        for AVX, AVX2 and AVX512F by including matmul_internal.m4
        multiple times.
        * generated/matmul_c10.c: Regenerated.
        * generated/matmul_c16.c: Regenerated.
        * generated/matmul_c4.c: Regenerated.
        * generated/matmul_c8.c: Regenerated.
        * generated/matmul_i1.c: Regenerated.
        * generated/matmul_i16.c: Regenerated.
        * generated/matmul_i2.c: Regenerated.
        * generated/matmul_i4.c: Regenerated.
        * generated/matmul_i8.c: Regenerated.
        * generated/matmul_r10.c: Regenerated.
        * generated/matmul_r16.c: Regenerated.
        * generated/matmul_r4.c: Regenerated.
        * generated/matmul_r8.c: Regenerated.


Added:
    trunk/libgcc/config/i386/cpuinfo.h
    trunk/libgfortran/m4/matmul_internal.m4
Modified:
    trunk/libgcc/ChangeLog
    trunk/libgcc/config/i386/cpuinfo.c
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/Makefile.am
    trunk/libgfortran/Makefile.in
    trunk/libgfortran/acinclude.m4
    trunk/libgfortran/config.h.in
    trunk/libgfortran/configure
    trunk/libgfortran/configure.ac
    trunk/libgfortran/generated/matmul_c10.c
    trunk/libgfortran/generated/matmul_c16.c
    trunk/libgfortran/generated/matmul_c4.c
    trunk/libgfortran/generated/matmul_c8.c
    trunk/libgfortran/generated/matmul_i1.c
    trunk/libgfortran/generated/matmul_i16.c
    trunk/libgfortran/generated/matmul_i2.c
    trunk/libgfortran/generated/matmul_i4.c
    trunk/libgfortran/generated/matmul_i8.c
    trunk/libgfortran/generated/matmul_r10.c
    trunk/libgfortran/generated/matmul_r16.c
    trunk/libgfortran/generated/matmul_r4.c
    trunk/libgfortran/generated/matmul_r8.c
    trunk/libgfortran/m4/matmul.m4
Comment 23 Dominique d'Humieres 2016-12-03 15:07:55 UTC
Timings before r243219

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      0.020      0.059      0.140      0.181
    3  5000      0.475      0.551      0.411      0.531
    4  5000      1.011      1.120      0.951      1.131
    5  5000      1.446      1.512      1.286      1.490
    7  5000      2.481      2.323      2.313      2.573
    8  5000      3.511      2.496      3.402      3.678
    9  5000      3.575      2.300      2.074      2.694
   15  5000      4.395      3.242      5.172      5.299
   16  5000      5.907      3.228      5.920      6.009
   17  5000      5.445      3.804      4.681      5.489
   31  5000      7.133      4.291      7.209      7.304
   32  5000      7.984      4.323      7.197      7.580
   33  5000      6.739      4.488      7.306      7.377
   63  5000      8.718      4.682      8.997      9.170
   64  5000      9.667      4.555      9.611      9.882
   65  5000      9.263      4.462      9.018      9.418
  127  3920     10.378      4.287     10.327     10.296
  128  3829     10.960      4.353     10.967     11.138
  129  3741     10.343      4.315     10.065     10.440
  255   483     11.370      4.522     11.511     11.229
  256   477     11.589      4.538     11.841     11.307
  257   472     10.983      4.532     10.721     10.955
  511    60     11.341      4.476     10.970     11.399
  512    59     12.164      4.666     12.257     11.726
  513    59     11.044      4.575     11.141     10.582
 1023     7     11.059      3.900     11.374     11.313
 1024     7     12.030      3.908     11.773     11.275
 1025     7     10.912      3.933     10.598     11.072

at r243219

=========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      0.096      0.108      0.098      0.125
    3  5000      0.353      0.411      0.290      0.355
    4  5000      0.779      0.770      0.651      0.846
    5  5000      1.176      1.286      1.088      1.193
    7  5000      2.089      2.260      1.991      2.142
    8  5000      3.232      2.430      3.164      3.486
    9  5000      3.380      2.747      3.370      3.575
   15  5000      4.668      3.018      4.481      4.692
   16  5000      5.184      3.506      5.987      6.404
   17  5000      5.747      3.348      5.596      5.774
   31  5000      6.995      4.036      7.046      7.040
   32  5000      8.822      4.161      7.868      8.076
   33  5000      7.778      4.348      8.078      8.090
   63  5000      9.600      4.509      9.682      9.367
   64  5000     11.616      4.365     11.045     10.845
   65  5000     10.434      4.337     10.536     10.558
  127  3920     11.975      4.259     12.065     11.979
  128  3829     13.767      4.307     12.918     13.469
  129  3741     12.370      4.139     11.410     12.350
  255   483     13.292      4.462     14.016     14.005
  256   477     14.298      4.477     14.312     15.027
  257   472     13.436      4.352     13.014     13.565
  511    60     13.484      4.574     14.024     13.789
  512    59     13.803      4.459     14.284     14.950
  513    59     13.094      4.479     13.069     13.234
 1023     7     13.952      3.914     14.194     13.873
 1024     7     14.636      3.837     14.675     14.987
 1025     7     13.649      3.953     13.594     13.701

For reference with -fexternal-blas

 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  5000      0.096      0.107      0.091      0.127
    3  5000      0.370      0.411      0.293      0.371
    4  5000      0.812      0.825      0.692      0.812
    5  5000      1.254      1.292      1.117      1.273
    7  5000      2.382      2.345      2.295      2.536
    8  5000      3.483      2.501      2.804      2.192
    9  5000      2.421      2.058      2.574      3.121
   15  5000      5.077      3.244      5.233      5.298
   16  5000      5.797      3.220      5.799      5.762
   17  5000      5.354      2.891      5.287      5.474
   31  5000      9.939      4.311     11.991     12.169
   32  5000     15.715      4.006     15.851     16.007
   33  5000     13.375      4.290     14.441     14.977
   63  5000     18.057      4.683     18.372     17.800
   64  5000     21.426      4.270     20.842     22.123
   65  5000     18.861      4.385     20.410     19.707
  127  3920     21.448      4.288     20.904     21.320
  128  3829     44.731      4.312     44.129     40.524
  129  3741     36.300      4.109     38.858     36.359
  255   483     52.876      4.310     57.982     54.261
  512    59     59.823      4.688     66.297     60.748
  513    59     58.666      4.559     60.481     57.547
 1023     7     61.315      3.900     64.559     61.124
 1024     7     63.148      3.861     68.033     62.486
 1025     7     58.991      3.895     55.074     58.168
Comment 24 Thomas Koenig 2017-02-27 13:54:45 UTC
Could be a good idea to add a version with -mfma to the flags for AVX2.

I'll see what I can do. It might be too late for gcc 7, and I also
don't have an AVX2 machine to test on.

Might also be a good idea to include this for AVX512F (if it is
automatically included).
Comment 25 Thomas Koenig 2017-03-02 11:04:32 UTC
Author: tkoenig
Date: Thu Mar  2 11:04:01 2017
New Revision: 245836

URL: https://gcc.gnu.org/viewcvs?rev=245836&root=gcc&view=rev
Log:
2017-03-02  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR fortran/78379
	* m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
	reals.  Add fma to target options.
	(matmul_'rtype_code`):  Call AVX2 only if FMA is available.
        * generated/matmul_c10.c: Regenerated.
        * generated/matmul_c16.c: Regenerated.
        * generated/matmul_c4.c: Regenerated.
        * generated/matmul_c8.c: Regenerated.
        * generated/matmul_i1.c: Regenerated.
        * generated/matmul_i16.c: Regenerated.
        * generated/matmul_i2.c: Regenerated.
        * generated/matmul_i4.c: Regenerated.
        * generated/matmul_i8.c: Regenerated.
        * generated/matmul_r10.c: Regenerated.
        * generated/matmul_r16.c: Regenerated.
        * generated/matmul_r4.c: Regenerated.
        * generated/matmul_r8.c: Regenerated.


Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/generated/matmul_c10.c
    trunk/libgfortran/generated/matmul_c16.c
    trunk/libgfortran/generated/matmul_c4.c
    trunk/libgfortran/generated/matmul_c8.c
    trunk/libgfortran/generated/matmul_i1.c
    trunk/libgfortran/generated/matmul_i16.c
    trunk/libgfortran/generated/matmul_i2.c
    trunk/libgfortran/generated/matmul_i4.c
    trunk/libgfortran/generated/matmul_i8.c
    trunk/libgfortran/generated/matmul_r10.c
    trunk/libgfortran/generated/matmul_r16.c
    trunk/libgfortran/generated/matmul_r4.c
    trunk/libgfortran/generated/matmul_r8.c
    trunk/libgfortran/m4/matmul.m4
Comment 26 David Edelsohn 2017-03-02 14:48:01 UTC
What is AVX-specific, as opposed to SIMD vector size-specific, about this feature? It seems that this should be enabled for all SIMD architectures of the appropriate width.
Comment 27 Thomas Koenig 2017-03-03 19:41:40 UTC
(In reply to # David Edelsohn from comment #26)
> What is AVX-specific, as opposed to SIMD vector size-specific, about this
> feature? It seems that this should be enabled for all SIMD architectures of
> the appropriate width.

You're right, this might as well apply to other architectures where
SIMD instructions are available only on some architectures, but
cannot be turned on by default because they are not universally
implemented.

I would need three pieces of information:

- What to put into the libgfortran config file to check if
  the installed binutils support the SIMD extension in question

- How to check at runtime for the specific processor version

- Which options to pass to __attribute__((__target__ ..

Then it is relatively straightforward to put this in.
Comment 28 David Edelsohn 2017-03-03 19:51:31 UTC
Because PPC64LE Linux reset the base ISA level, VSX now is enabled by default, so a function clone for VSX probably isn't necessary.  While special versions might help AIX and PPC64BE, with lower ISA defaults, those are not the focus.
Comment 29 Thomas Koenig 2017-03-03 22:31:51 UTC
(In reply to David Edelsohn from comment #28)
> Because PPC64LE Linux reset the base ISA level, VSX now is enabled by
> default, so a function clone for VSX probably isn't necessary.  While
> special versions might help AIX and PPC64BE, with lower ISA defaults, those
> are not the focus.

What about ARM NEON?  Is this time of the normal ISA level?
Comment 30 Thomas Koenig 2017-05-07 10:41:55 UTC
I think there still is one thing to do.

Apparently, AMD CPUs (which use only vanilla at
the moment) are slightly faster with -mprefer-avx128,
and they should be much faster if they have FMA3.

Unless I missed something, it is not possible to
specify something like -mprefer-avx128 as a target
attribute.

What would be the best way to go about this?
Comment 31 Thomas Koenig 2017-05-22 16:44:50 UTC
Created attachment 41405 [details]
Patch for AMD

Here's a proposed patch for AMDs. This does AVX128 and FMA
when both are available, or AVX128 and FMA4, or nothing.

Rationale is that AVX128 alone does not do a lot for
AMD processors.

The new files will come as a separate attachment.
Comment 32 Thomas Koenig 2017-05-22 16:46:11 UTC
Created attachment 41406 [details]
Additional files for the previous patch

Here are the new files for the patch.
Comment 33 Jerry DeLisle 2017-05-24 00:50:23 UTC
(In reply to Thomas Koenig from comment #32)
> Created attachment 41406 [details]
> Additional files for the previous patch
> 
> Here are the new files for the patch.

Well I tried to apply the patch and test without using maintainer mode.

Running my tests in the debugger, breaking and dis-assembly shows xmmm instructions and calls to matmul_vanilla so I think I need to enable maintainer mode and rebuild, or something is not quite right.

Suggestions?
Comment 34 Thomas Koenig 2017-05-24 06:13:18 UTC
Created attachment 41410 [details]
Patch which has all the files

Well, I suspect my way of splitting the previous patch into
one real patch and one *.tar.gz - file was not really the best way
to go :-)

Here is a patch which should include all the new files.

At least it fits into the 1000 kb limit.
Comment 35 Jerry DeLisle 2017-05-24 14:15:30 UTC
(In reply to Thomas Koenig from comment #34)
> Created attachment 41410 [details]
> Patch which has all the files
> 
> Well, I suspect my way of splitting the previous patch into
> one real patch and one *.tar.gz - file was not really the best way
> to go :-)
> 
> Here is a patch which should include all the new files.
> 
> At least it fits into the 1000 kb limit.

I am finishing a build in maintainer mode so will try the first approach and if that fails, will try the new patch. Everything looks reasonable, just think we should test on my AMD boxes.
Comment 36 Jerry DeLisle 2017-05-24 15:08:03 UTC
Results look very good.

Gfortran 7, no patch gives:

$ gfc7 -static -Ofast -ftree-vectorize compare.f90 
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      4.706      0.046      0.094      0.162
    4  2000      1.246      0.246      0.305      0.351
    8  2000      1.410      0.605      0.958      1.791
   16  2000      5.413      2.787      2.228      2.615
   32  2000      4.676      3.416      4.622      4.618
   64  2000      6.368      2.652      6.339      6.167
  128  2000      8.165      2.998      8.118      8.260
  256   477      9.334      3.202      9.248      9.355
  512    59      8.730      2.239      8.596      8.730
 1024     7      8.805      1.378      8.673      8.812
 2048     1      8.781      1.728      8.649      8.789

Latest gfortran trunk with patch gives:

$ gfc -static -Ofast -ftree-vectorize compare.f90 
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      4.738      0.048      0.092      0.172
    4  2000      1.438      0.248      0.305      0.378
    8  2000      1.511      0.617      1.177      1.955
   16  2000      5.426      2.810      1.854      2.881
   32  2000      4.688      3.314      4.357      5.091
   64  2000      6.669      2.674      6.629      7.110
  128  2000      9.139      3.000      9.076      9.131
  256   477     10.495      3.184     10.466     10.516
  512    59      9.577      2.189      9.477      9.635
 1024     7      9.593      1.381      9.519      9.658
 2048     1      9.722      1.709      9.625      9.785
Comment 37 Thomas Koenig 2017-05-25 21:51:59 UTC
Author: tkoenig
Date: Thu May 25 21:51:27 2017
New Revision: 248472

URL: https://gcc.gnu.org/viewcvs?rev=248472&root=gcc&view=rev
Log:
2017-05-25  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR libfortran/78379
	* Makefile.am: Add generated/matmulavx128_*.c files.
	Handle them for compiling and setting the right flags.
	* acinclude.m4: Add tests for FMA3, FMA4 and AVX128.
	* configure.ac: Call them.
	* Makefile.in: Regenerated.
	* config.h.in: Regenerated.
	* configure: Regenerated.
	* m4/matmul.m4:  Handle AMD chips by calling 128-bit AVX
	versions which use FMA3 or FMA4.
	* m4/matmulavx128.m4: New file.
        * generated/matmul_c10.c: Regenerated.
        * generated/matmul_c16.c: Regenerated.
        * generated/matmul_c4.c: Regenerated.
        * generated/matmul_c8.c: Regenerated.
        * generated/matmul_i1.c: Regenerated.
        * generated/matmul_i16.c: Regenerated.
        * generated/matmul_i2.c: Regenerated.
        * generated/matmul_i4.c: Regenerated.
        * generated/matmul_i8.c: Regenerated.
        * generated/matmul_r10.c: Regenerated.
        * generated/matmul_r16.c: Regenerated.
        * generated/matmul_r4.c: Regenerated.
        * generated/matmul_r8.c: Regenerated.
        * generated/matmulavx128_c10.c: New file.
        * generated/matmulavx128_c16.c: New file.
        * generated/matmulavx128_c4.c: New file.
        * generated/matmulavx128_c8.c: New file.
        * generated/matmulavx128_i1.c: New file.
        * generated/matmulavx128_i16.c: New file.
        * generated/matmulavx128_i2.c: New file.
        * generated/matmulavx128_i4.c: New file.
        * generated/matmulavx128_i8.c: New file.
        * generated/matmulavx128_r10.c: New file.
        * generated/matmulavx128_r16.c: New file.
        * generated/matmulavx128_r4.c: New file.
        * generated/matmulavx128_r8.c: New file.


Added:
    trunk/libgfortran/generated/matmulavx128_c10.c
    trunk/libgfortran/generated/matmulavx128_c16.c
    trunk/libgfortran/generated/matmulavx128_c4.c
    trunk/libgfortran/generated/matmulavx128_c8.c
    trunk/libgfortran/generated/matmulavx128_i1.c
    trunk/libgfortran/generated/matmulavx128_i16.c
    trunk/libgfortran/generated/matmulavx128_i2.c
    trunk/libgfortran/generated/matmulavx128_i4.c
    trunk/libgfortran/generated/matmulavx128_i8.c
    trunk/libgfortran/generated/matmulavx128_r10.c
    trunk/libgfortran/generated/matmulavx128_r16.c
    trunk/libgfortran/generated/matmulavx128_r4.c
    trunk/libgfortran/generated/matmulavx128_r8.c
    trunk/libgfortran/m4/matmulavx128.m4
Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/Makefile.am
    trunk/libgfortran/Makefile.in
    trunk/libgfortran/acinclude.m4
    trunk/libgfortran/config.h.in
    trunk/libgfortran/configure
    trunk/libgfortran/configure.ac
    trunk/libgfortran/generated/matmul_c10.c
    trunk/libgfortran/generated/matmul_c16.c
    trunk/libgfortran/generated/matmul_c4.c
    trunk/libgfortran/generated/matmul_c8.c
    trunk/libgfortran/generated/matmul_i1.c
    trunk/libgfortran/generated/matmul_i16.c
    trunk/libgfortran/generated/matmul_i2.c
    trunk/libgfortran/generated/matmul_i4.c
    trunk/libgfortran/generated/matmul_i8.c
    trunk/libgfortran/generated/matmul_r10.c
    trunk/libgfortran/generated/matmul_r16.c
    trunk/libgfortran/generated/matmul_r4.c
    trunk/libgfortran/generated/matmul_r8.c
    trunk/libgfortran/m4/matmul.m4
Comment 38 Thomas Koenig 2017-05-26 05:20:01 UTC
This works for Intel and AMD now.

If anybody wants another architecture, we know how to do it.

Closing.