Bug 81616 - Update -mtune=generic for the current Intel and AMD processors
Summary: Update -mtune=generic for the current Intel and AMD processors
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Jan Hubicka
URL:
Keywords:
Depends on:
Blocks: 80820
  Show dependency treegraph
 
Reported: 2017-07-30 15:16 UTC by H.J. Lu
Modified: 2022-12-14 07:37 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-07-31 00:00:00


Attachments
Test program used for the attached performance results (matrix.c) (300 bytes, text/x-csrc)
2017-11-23 06:41 UTC, Andrew Roberts
Details
Test results for Ryzen system with matrix.c (2.72 KB, text/plain)
2017-11-23 06:43 UTC, Andrew Roberts
Details
Test results for Haswell system with matrix.c (1.44 KB, text/plain)
2017-11-23 06:43 UTC, Andrew Roberts
Details
Test results for Skylake system with matrix.c (1.21 KB, text/plain)
2017-11-23 06:44 UTC, Andrew Roberts
Details
Script for matrix.c test program (456 bytes, application/x-csh)
2017-11-23 07:21 UTC, Andrew Roberts
Details
modified mt19937ar test program, test script and results (25.88 KB, application/gzip)
2017-11-28 07:40 UTC, Andrew Roberts
Details
Untested fix for harmful FMAs (6.30 KB, patch)
2017-12-13 14:28 UTC, Martin Jambor
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2017-07-30 15:16:02 UTC
-mtune=generic should be updated for the current Intel and AMD
processors.
Comment 1 H.J. Lu 2017-07-30 15:25:15 UTC
*** Bug 81614 has been marked as a duplicate of this bug. ***
Comment 2 Richard Biener 2017-07-31 07:14:12 UTC
Confirmed.  Honza is working on this.
Comment 3 Jan Hubicka 2017-11-19 14:12:19 UTC
I am mostly done with my tuning overhaul for core+ and znver and
I plan to work on generic now in early stage3.  My rough plan is
 - drop flags that are there for benefit of anything earlier than core2 and buldozer
 - base costs of instructions on Haswell (and later)+ZNver1 latencies, keep in mind buldozers.
 - revisit code alignment strategies. It seems to me that by default we align way too much for both core and Zen. Maybe code alignment does not pay back at all for -O2 and should be done at -Ofast only or so.
 - switch instruction scheduling to more modern chip (currently we schedule for K8).  Here I need to figure out how much core based chips care about particular scheduler model, but I suspect both core and Zen are quite neutral here and mostly benefit from basic scheduling for latencies.
 - figure out best vectorization model - here AVX may be a fun, because core and znver preffers different kind of codegen. 

Ideas are welcome.
Comment 4 Andrew Roberts 2017-11-23 06:37:16 UTC
I've been testing on a Ryzen system and also comparing with Haswell and Skylake. From my testing -mtune=znver1 does not perform well and never has, including as of last snapshot:
gcc version 8.0.0 20171119 (experimental) (GCC)

-mtune=generic seems a better option for all three systems as a default for -march=native

This is only with one test case (attached), but I've seen the same across many other tests.

See the attached testcase (matix.c) and performance logs 
Ryzen - znver1-tunebug.txt
Haswell - znver1-tunebug2.txt
Skylake - znver1-tunebug3.txt
Comment 5 Andrew Roberts 2017-11-23 06:39:04 UTC
I've been testing on a Ryzen system and also comparing with Haswell and Skylake. From my testing -mtune=znver1 does not perform well and never has, including as of last snapshot:
gcc version 8.0.0 20171119 (experimental) (GCC)

-mtune=generic seems a better option for all three systems as a default for -march=native

This is only with one test case (attached), but I've seen the same across many other tests.

See the attached testcase (matix.c) and performance logs 
Ryzen - znver1-tunebug.txt
Haswell - znver1-tunebug2.txt
Skylake - znver1-tunebug3.txt
Comment 6 Andrew Roberts 2017-11-23 06:41:58 UTC
Created attachment 42687 [details]
Test program used for the attached performance results (matrix.c)

Test program used for the attached performance results (matrix.c)
Comment 7 Andrew Roberts 2017-11-23 06:43:03 UTC
Created attachment 42688 [details]
Test results for Ryzen system with matrix.c

Test results for Ryzen system with matrix.c
Comment 8 Andrew Roberts 2017-11-23 06:43:41 UTC
Created attachment 42689 [details]
Test results for Haswell system with matrix.c

Test results for Haswell system with matrix.c
Comment 9 Andrew Roberts 2017-11-23 06:44:27 UTC
Created attachment 42690 [details]
Test results for Skylake system with matrix.c

Test results for Skylake system with matrix.c
Comment 10 Andrew Roberts 2017-11-23 07:21:13 UTC
Created attachment 42691 [details]
Script for matrix.c test program

Script for matrix.c test program
Comment 11 Jakub Jelinek 2017-11-23 07:54:45 UTC
I've been also wondering if the ISA selection shouldn't affect -mtune=generic tuning, say in TUs or even just functions that have AVX512* enabled the generic tuning shouldn't be taken just from the set of CPUs that currently support that ISA.  Of course that would change once some AMD chips start supporting it.
Comment 12 Andrew Roberts 2017-11-27 05:51:20 UTC
Ok I've tried again with this weeks snapshot:

gcc version 8.0.0 20171126 (experimental) (GCC) 

Taking combination of -march and -mtune which works well on Ryzen:

/usr/local/gcc/bin/gcc -march=core-avx-i -mtune=nocona -O3 matrix.c -o matrix
./matrix
mult took     131153 clocks

Then switching to -mtune=znver1

/usr/local/gcc/bin/gcc -march=core-avx-i -mtune=znver1 -O3 matrix.c -o matrix
./matrix
 mult took     231309 clocks

Then looking at the differences in the -Q --help=target output for these two and eliminating each difference at a time, I found that:

gcc -march=core-avx-i -mtune=znver1 -mprefer-vector-width=none -O3 matrix.c -o matrix
[aroberts@ryzen share]$ ./matrix
mult took     132295 clocks

The default for znver1 is: -mprefer-vector-width=128

So is this option still helping with the latest microcode? Not in this case at least.

cat /proc/cpuinfo : 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD Ryzen 7 1700 Eight-Core Processor
stepping	: 1
microcode	: 0x8001129

with -march=znver1 -mtune=znver1
with default of -mprefer-vector-width=128
mult took     386291 clocks

with -march=znver1 -mtune=znver1 -mprefer-vector-width=none
mult took     201455 clocks
Comment 13 Jan Hubicka 2017-11-27 14:26:40 UTC
> So is this option still helping with the latest microcode? Not in this case at
> least.

It is on my TODO list to re-benchmark 256bit vectorization for Zen.  I do not
think microcode is a big difference here.  Using 256 bit vectors has advantage
of exposing more of parallelism but also disadvantage of requiring more
involved setup.  So for loops that vectorize naturally (like matrix
multiplication) it can be win, while for loops that are difficult to vectorize
it is a loss. So I think the early benchmarks did not look consistent and it is
why 128bit mode was introduced.

It is not that different form vectorizing for K8 which had split SSE registers
in a similar fashion or for kabylake which splits 512 bit operations.

While rewriting the cost-model I tried to keep this in mind and more acurately
model the split operations, so it may be possible to switch to 256 by default.

Ideally vectorizer should make a deicsion whether 128 or 256 is win for
partiuclar loop but it doesn't seem to have infrastructure to do so.
My plan is to split current flag into two - preffer 128bit and assume
that registers are internally split and see if that is enough to get consistent
win for 256 bit vectorization.

Richi may know better.

Honza
Comment 14 Andrew Roberts 2017-11-27 15:17:13 UTC
It would be nice if znver1 for -march and -mtune could be improved before the gcc 8 release. At present -march=znver1 -mtune=znver1 looks be to about the worst thing you could do, and not just on this vectorizable code. And given we tell people to use -march=native which gives this, it would be nice to improve.

With the attached example switching to larger vectors still only gets to 200000 clocks, whereas other combinations get down to 116045

mult took 116045 clocks -march=corei7-avx -mtune=skylake

So there is more going on here than just the vector length.

If there is any testing to isolate other options I would be happy to help, just point me in the right direction. If there are good (open) benchmarks I can routinely test on a range of targets I would be happy to. I have ryzen, haswell, skylake, arm, aarch64, etc.
Comment 15 Jan Hubicka 2017-11-27 15:29:58 UTC
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616
> 
> --- Comment #14 from Andrew Roberts <andrewm.roberts at sky dot com> ---
> It would be nice if znver1 for -march and -mtune could be improved before the
> gcc 8 release. At present -march=znver1 -mtune=znver1 looks be to about the
> worst thing you could do, and not just on this vectorizable code. And given we
> tell people to use -march=native which gives this, it would be nice to improve.

We benchmarked znver1 tuning quite thoroughly with spec2000, spec2006 and 2017
and istuation is not that bad. 
In August, with -O2 native tuning is about 0.3% (for both in and fp) better
than generic (this does not include vectorization becuase of -O2 and keep in
mind that spec is often bound by memory, 0.3% difference is quite noticable).
All regressions in individual benchmarks were under 2% and some fixed since then.

For -Ofast the difference is about 0.5% for integer with two notable regressions
wich have WIP solutions for.

Integer/core tuning went worse than generic so things was as indtended.

I will quickly re-test 256bit vectorization with specfp2k (that is fast).
Please attach regressing testcases you have and I will take a look, too.

Honza
Comment 16 Richard Biener 2017-11-27 15:35:57 UTC
(In reply to Jan Hubicka from comment #13)
> > So is this option still helping with the latest microcode? Not in this case at
> > least.
> 
> It is on my TODO list to re-benchmark 256bit vectorization for Zen.  I do not
> think microcode is a big difference here.  Using 256 bit vectors has
> advantage
> of exposing more of parallelism but also disadvantage of requiring more
> involved setup.  So for loops that vectorize naturally (like matrix
> multiplication) it can be win, while for loops that are difficult to
> vectorize
> it is a loss. So I think the early benchmarks did not look consistent and it
> is
> why 128bit mode was introduced.
> 
> It is not that different form vectorizing for K8 which had split SSE
> registers
> in a similar fashion or for kabylake which splits 512 bit operations.
> 
> While rewriting the cost-model I tried to keep this in mind and more
> acurately
> model the split operations, so it may be possible to switch to 256 by
> default.
> 
> Ideally vectorizer should make a deicsion whether 128 or 256 is win for
> partiuclar loop but it doesn't seem to have infrastructure to do so.
> My plan is to split current flag into two - preffer 128bit and assume
> that registers are internally split and see if that is enough to get
> consistent
> win for 256 bit vectorization.
> 
> Richi may know better.

The vectorizer cannot currently evaluate both (or multiple) vector length
vectorization costs against each other.  Doing so with the current
implementation would have prohibitive cost (basically do the analysis
phase twice and if unlucky and the "first" wins, re-do analysis phase
of the winner).

Hmm, maybe not _too_ bad in the end...

But first and foremost costing is not aware of split AVX256 penalties,
so I'm not sure if doing the above would help.

I can cook up some "quick" prototype (maybe hidden behind a --param
paywall) so one could benchmark such mode.

Is there interest?

> Honza
Comment 17 Andrew Roberts 2017-11-27 15:56:29 UTC
The general consensus in userland is that the znver1 optimization is much worse than 0.5%, or even 2% off. Most people are using -march=haswell if they care about performance.

Just taking one part of one of my apps I see a 5% difference with -march=haswell vs -march=znver1, and this is just general code (loading GL extensions). 

The trick is to remove system dependencies from things I could benchmark. If there are no recommendations, I'll come up with some tests myself for various workloads, and try across various march/tune combos.

I'll also look at some other real world benchmarks that are available online.
Comment 18 Andrew Roberts 2017-11-28 07:35:01 UTC
Ok trying an entirely different algorith, same results:

Using Mersenne Twister algorithm from here:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html

alter main program to comment out original test harness, and replace
main with:

int main(void)
{
    int i;
    unsigned long init[4]={0x123, 0x234, 0x345, 0x456}, length=4;
    init_by_array(init, length);
    clock_t e, s=clock();
    int j=genrand_int32();
    for(i=0; i<100000000; i++)
    {
      j ^= genrand_int32();
    }
    e=clock();
    if (j != -549769613) printf("Error j != -549769613 (%d)\n", j);
    printf("mt19937ar took %ld clocks ", (long)(e-s));
    return 0;
}

So nothing complicated.
On Ryzen:
--------

Top 5:
mt19937ar took 354877 clocks -march=amdfam10 -mtune=k8
mt19937ar took 356203 clocks -march=bdver2 -mtune=eden-x2
mt19937ar took 356534 clocks -march=nano-x2 -mtune=nano-1000
mt19937ar took 357321 clocks -march=athlon-fx -mtune=nano-x4
mt19937ar took 357634 clocks -march=bdver3 -mtune=nano-x2

Bot 5:
mt19937ar took 675052 clocks -march=nano -mtune=btver1
mt19937ar took 679826 clocks -march=k8 -mtune=nocona
mt19937ar took 681118 clocks -march=opteron -mtune=atom
mt19937ar took 689604 clocks -march=core2 -mtune=broadwell
mt19937ar took 699840 clocks -march=skylake -mtune=generic

Top -mtune=znver1
mt19937ar took 369722 clocks -march=nano-x2 -mtune=znver1

Top -march=znver1
mt19937ar took 375286 clocks -march=znver1 -mtune=silvermont

-march=znver1 -mtune=znver1 (aka native)
mt19937ar took 430875 clocks -march=znver1 -mtune=znver1

-march=haswell -mtune=haswell
mt19937ar took 402963 clocks -march=haswell -mtune=haswell

-march=k8 -mtune=k8
mt19937ar took 367890 clocks -march=k8 -mtune=k8

so -march=znver1 -mtune=znver1 is:
7% slower than tuning for haswell
17% slower than tuning for k8

Again -mtune=znver1, -mtune=bdverX, -mtune=btverX all cluster at the bottom

On Haswell:
----------

Top 5:
mt19937ar took 290000 clocks -march=amdfam10 -mtune=barcelona
mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver1
mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver2
mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver3
mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver4

Bot 5:
mt19937ar took 370000 clocks -march=znver1 -mtune=bdver3
mt19937ar took 370000 clocks -march=znver1 -mtune=bdver4
mt19937ar took 370000 clocks -march=znver1 -mtune=btver2
mt19937ar took 370000 clocks -march=znver1 -mtune=znver1
mt19937ar took 380000 clocks -march=knl -mtune=bdver1

Top -mtune=haswell
mt19937ar took 300000 clocks -march=bdver4 -mtune=haswell

Top -march=haswell
mt19937ar took 300000 clocks -march=haswell -mtune=broadwell

-march=haswell -mtune=haswell (aka native)
mt19937ar took 300000 clocks -march=haswell -mtune=haswell

Best performing pair:
mt19937ar took 290000 clocks -march=barcelona -mtune=barcelona

so the haswell options are pretty much optimal on that hardware
 as from other test.
Comment 19 Andrew Roberts 2017-11-28 07:40:28 UTC
Created attachment 42735 [details]
modified mt19937ar test program, test script and results

modified mt19937ar test program, test script and results

tar -tf mt19937ar-test.tar.gz
./doit.csh               <= Test script, change path to gcc!
./mt19937ar.c            <= main function altered to give test results
./mt19937ar-haswell.txt  <= full results on Intel Core i5-4570S
./mt19937ar-ryzen.txt    <= full results on AMD Ryzen 7 1700 Eight-Core Processor
Comment 20 Andrew Roberts 2017-11-28 07:45:27 UTC
Again those latest mt19937ar results above were with the current snapshot:

/usr/local/gcc/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc/bin/gcc
COLLECT_LTO_WRAPPER=/usr/local/gcc-8.0.0/libexec/gcc/x86_64-unknown-linux-gnu/8.0.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-8.0.0/configure --prefix=/usr/local/gcc-8.0.0 --program-suffix= --disable-werror --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --enable-gnu-indirect-function --with-isl --enable-languages=c,c++,fortran,lto --disable-libgcj --enable-lto --enable-multilib --with-tune=generic --with-arch_32=i686 --host=x86_64-unknown-linux-gnu --build=x86_64-unknown-linux-gnu --disable-bootstrap
Thread model: posix
gcc version 8.0.0 20171126 (experimental) (GCC)
Comment 21 Jan Hubicka 2017-11-28 08:13:58 UTC
Hi,
this is comparing SPEC2000 -Ofast -march=native -mprefer-vector-width=128
to -Ofast -march=native -mprefer-vector-width=256 on Ryzen.

   168.wupwise       1600    28.2    5669    *     1600    30.8    5187    *
   171.swim          3100    26.4    11763    *     3100    27.5    11261    *
   172.mgrid         1800    26.1    6907    *     1800    30.9    5827    *
   173.applu         2100    25.5    8234    *     2100    25.7    8161    *
   177.mesa          1400    23.4    5971    *     1400    23.2    6030    *
   178.galgel                                X                             X
   179.art           2600    10.9    23752    *     2600    10.9    23777    *
   183.equake        1300    12.9    10047    *     1300    12.9    10063    *
   187.facerec       1900    17.2    11025    *     1900    24.0    7921    *
   188.ammp          2200    34.2    6431    *     2200    34.4    6397    *
   189.lucas         2000    20.3    9859    *     2000    20.4    9807    *
   191.fma3d         2100    29.7    7061    *     2100    31.4    6694    *
   200.sixtrack      1100    38.8    2834    *     1100    41.5    2648    *
   301.apsi          2600    33.0    7873    *     2600    33.1    7856    *
   Est. SPECfp_base2000              8049
   Est. SPECfp2000                                                 7590

   164.gzip          1400    57.1    2450    *     1400    58.0    2413    *
   175.vpr           1400    37.4    3746    *     1400    37.5    3733    *
   176.gcc           1100    20.2    5450    *     1100    20.0    5489    *
   181.mcf           1800    21.7    8310    *     1800    21.4    8402    *
   186.crafty        1000    20.5    4874    *     1000    20.9    4794    *
   197.parser        1800    51.7    3481    *     1800    51.5    3498    *
   252.eon           1300    18.2    7154    *     1300    19.2    6759    *
   253.perlbmk                               X                             X
   254.gap                                   X                             X
   255.vortex                                X                             X
   256.bzip2         1500    42.6    3522    *     1500    42.9    3496    *
   300.twolf         3000    56.5    5313    *     3000    56.3    5330    *
   Est. SPECint_base2000             4612
   Est. SPECint2000                                                4575

So it does not seem to be win in general.  I will compare with -mtune=haswell
now
Comment 22 Jan Hubicka 2017-11-28 14:19:13 UTC
Hi,
this is same base (so you can see there is some noise) compared to haswell tuning
   164.gzip          1400    57.1    2452    *     1400    58.7    2384    *
   175.vpr           1400    37.1    3776    *     1400    38.3    3659    *
   176.gcc           1100    20.0    5500    *     1100    20.1    5464    *
   181.mcf           1800    21.6    8327    *     1800    20.9    8617    *
   186.crafty        1000    20.4    4905    *     1000    21.0    4760    *
   197.parser        1800    51.3    3506    *     1800    51.9    3466    *
   252.eon           1300    18.2    7162    *     1300    19.2    6781    *
   253.perlbmk                               X                             X
   254.gap                                   X                             X
   255.vortex                                X                             X
   256.bzip2         1500    42.4    3537    *     1500    44.1    3401    *
   300.twolf         3000    56.4    5317    *     3000    56.3    5328    *
   Est. SPECint_base2000             4632
   Est. SPECint2000                                                4548

   168.wupwise       1600    28.2    5667    *     1600    28.7    5580    *
   171.swim          3100    26.3    11807    *     3100    27.4    11304    *
   172.mgrid         1800    26.0    6930    *     1800    31.0    5810    *
   173.applu         2100    25.5    8239    *     2100    25.6    8193    *
   177.mesa          1400    23.4    5970    *     1400    22.9    6116    *
   178.galgel                                X                             X
   179.art           2600    10.9    23807    *     2600    10.4    25014    *
   183.equake        1300    12.9    10039    *     1300    12.9    10060    *
   187.facerec       1900    17.3    11009    *     1900    20.8    9135    *
   188.ammp          2200    34.2    6441    *     2200    34.2    6428    *
   189.lucas         2000    20.7    9683    *     2000    20.7    9679    *
   191.fma3d         2100    29.7    7060    *     2100    31.5    6660    *
   200.sixtrack      1100    38.6    2847    *     1100    40.9    2687    *
   301.apsi          2600    33.1    7866    *     2600    32.7    7952    *
   Est. SPECfp_base2000              8045
   Est. SPECfp2000                                                 7766

So mes, arta and mcf sems to benefit from Haswell tunning.
Mesa is vectorization problem (we vectorize cold loop and introduce too much
of register pressure)

What is however interesting is that zen tuning with 256bit vectorization seems
to be worse than haswell tuning.  I will run haswell with 128bit vector size.

What your  matrix multiplication benchmark runs into is issue with multiply
and add instruction.  Once machine is free I will try it, but disabling fmadd
may solve the regression.

Honza

Honza
Comment 23 Andrew Roberts 2017-11-28 15:15:51 UTC
Thanks Honza,

getting closer, with original matrix.c on Ryzen:

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -O3 matrix.c -o matrix
        mult took     364850 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -O3 matrix.c -o matrix
       mult took     194517 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -O3 matrix.c -o matrix
        mult took     130343 clocks

/usr/local/gcc/bin/gcc -march=haswell -mtune=haswell -mprefer-vector-width=none -mno-fma -O3 matrix.c -o matrix
        mult took     130129 clocks

These last two are comparable with the fastest obtained from trying all combinations of -march and -mtune
Comment 24 Andrew Roberts 2017-11-28 15:22:26 UTC
For the mt19937ar test:

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -O3 mt19937ar.c -o mt19937ar
  mt19937ar took 462062 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -O3 mt19937ar.c -o mt19937ar
  mt19937ar took 412449 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -O3 mt19937ar.c -o mt19937ar
  mt19937ar took 419284 clocks

/usr/local/gcc/bin/gcc -march=haswell -mtune=haswell -mprefer-vector-width=none -mno-fma -O3 mt19937ar.c -o mt19937ar
  mt19937ar took 436768 clocks

/usr/local/gcc/bin/gcc -march=corei7-avx -mtune=skylake -O3 mt19937ar.c -o mt19937ar
  mt19937ar took 410302 clocks
Comment 25 Jan Hubicka 2017-11-28 18:06:36 UTC
Hi,
I agree that the matric multiplication fma issue is important and hopefully it
will be fixed for GCC 8.  See
https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html

The irregularity of tune/arch is probably originating from enabling/disabling fma
and avx256 preferrence.  I get
jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=native -mno-fma mult.c
jh@d136:~> ./a.out
        mult took     193593 clocks
jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=native -mno-fma -mprefer-vector-width=256 mult.c
jh@d136:~> ./a.out
        mult took     104745 clocks
jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=haswell -mprefer-vector-width=256 mult.c
jh@d136:~> ./a.out
        mult took     160123 clocks
jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=haswell -mprefer-vector-width=256 -mno-fma mult.c
jh@d136:~> ./a.out
        mult took     102048 clocks

90% difference on a common loop is quite noticeable.

Continuing my benchmarkings on spec2k.
This is -Ofast -march=native -mprefer-vector-width=none compared to 
-Ofast -march=native -mtune=haswell -mprefer-vector-width=128.
So neither of those are win compared to -mtune=native.

   164.gzip          1400    58.2    2407    *     1400    57.9    2419    *
   175.vpr           1400    37.5    3731    *     1400    37.8    3704    *
   176.gcc           1100    20.0    5494    *     1100    20.0    5497    *
   181.mcf           1800    21.6    8324    *     1800    20.8    8660    *
   186.crafty        1000    20.9    4790    *     1000    21.2    4722    *
   197.parser        1800    51.4    3499    *     1800    51.8    3472    *
   252.eon           1300    19.3    6749    *     1300    18.2    7143    *
   253.perlbmk                               X                             X
   254.gap                                   X                             X
   255.vortex                                X                             X
   256.bzip2         1500    43.1    3483    *     1500    43.5    3444    *
   300.twolf         3000    56.6    5302    *     3000    57.0    5267    *
   Est. SPECint_base2000             4563    
   Est. SPECint2000                                                4591

   168.wupwise       1600    30.9    5179    *     1600    29.7    5387    *
   171.swim          3100    27.4    11309    *     3100    26.4    11739    *
   172.mgrid         1800    31.0    5814    *     1800    26.1    6895    *
   173.applu         2100    25.7    8175    *     2100    25.9    8096    *
   177.mesa          1400    23.3    6006    *     1400    23.3    6001    *
   178.galgel                                X                             X
   179.art           2600    11.0    23702    *     2600    11.0    23718    *
   183.equake        1300    13.0    10033    *     1300    13.1    9944    *
   187.facerec       1900    24.0    7931    *     1900    17.2    11040    *
   188.ammp          2200    34.4    6394    *     2200    35.2    6249    *
   189.lucas         2000    20.3    9864    *     2000    20.8    9603    *
   191.fma3d         2100    31.4    6686    *     2100    30.0    7011    *
   200.sixtrack      1100    41.7    2641    *     1100    38.5    2856    *
   301.apsi          2600    34.1    7630    *     2600    34.2    7612    *
   Est. SPECfp_base2000              7570
   Est. SPECfp2000                                                 7947
Comment 26 Jan Hubicka 2017-11-28 18:14:15 UTC
On you matrix benchmarks I get:

  Vector inside of loop cost: 44
  Vector prologue cost: 12
  Vector epilogue cost: 0
  Scalar iteration cost: 40
  Scalar outside cost: 0
  Vector outside cost: 12
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
mult.c:15:7: note:   Runtime profitability threshold = 4
mult.c:15:7: note:   Static estimate profitability threshold = 4

  Vector inside of loop cost: 2428
  Vector prologue cost: 4
  Vector epilogue cost: 0
  Scalar iteration cost: 2428
  Scalar outside cost: 0
  Vector outside cost: 4
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
mult.c:30:7: note:   Runtime profitability threshold = 4
mult.c:30:7: note:   Static estimate profitability threshold = 4


for 128bit vectorization and for 256bit

  Vector inside of loop cost: 88
  Vector prologue cost: 24
  Vector epilogue cost: 0
  Scalar iteration cost: 40
  Scalar outside cost: 0
  Vector outside cost: 24
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
mult.c:15:7: note:   Runtime profitability threshold = 8
mult.c:15:7: note:   Static estimate profitability threshold = 8

  Vector inside of loop cost: 6472
  Vector prologue cost: 8
  Vector epilogue cost: 0
  Scalar iteration cost: 2428
  Scalar outside cost: 0
  Vector outside cost: 8
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
mult.c:30:7: note:   Runtime profitability threshold = 8
mult.c:30:7: note:   Static estimate profitability threshold = 8

So if vectorizer knew to preffer bigger vector sizes when cost is about double, it would vectoriye first loop to
256 as expected.
Comment 27 Jan Hubicka 2017-11-28 18:28:07 UTC
Hi,
one of problem here is use of vgather instruction.  It is hardly a win on Zen architecture.
It is also on my TODO to adjust the code model to disable it for most loops.  I only want
to benchmark if it is a win at all in some cases or not at all to set proper weights.
You can disable it with -mno-avx2

Still the code is bit worse than for -march=amdfam10 -mtune=k8 which is bit funny.
I will take a look at that.

Honza
Comment 28 Andrew Roberts 2017-11-29 04:02:10 UTC
Adding -mno-avx2 into the mix was a marginal win, but only just showing out of the noise:

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -mno-avx2 -O3 matrix.c -o matrix
       mult took     121397 clocks
       mult took     124373 clocks
       mult took     125345 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -O3 matrix.c -o matrix
        mult took     123262 clocks
        mult took     128193 clocks
        mult took     125891 clocks

Using -Ofast instead of -O3

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -Ofast matrix.c -o matrix
        mult took     125163 clocks
        mult took     123799 clocks
        mult took     122808 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -mno-avx2 -Ofast matrix.c -o matrix
        mult took     130189 clocks
        mult took     122726 clocks
        mult took     123686 clocks
Comment 29 Andrew Roberts 2017-11-29 04:47:03 UTC
And rerunning all the tests for matrix.c on Ryzen using:
-march=$amarch -mtune=$amtune -mprefer-vector-width=none -mno-fma -O3

The winners were:
mult took 118145 clocks -march=broadwell -mtune=broadwell
mult took 118912 clocks -march=core-avx2 -mtune=core-avx2

Top -mtune=znver1
mult took 121845 clocks -march=core-avx2 -mtune=znver1
mult took 129241 clocks -march=znver1 -mtune=znver1

And the bottom of the list no longer has a cluster of -mtune= btverX, bdverX, znver1

Worst cases:
mult took 253400 clocks -march=x86-64 -mtune=haswell
mult took 254006 clocks -march=bonnell -mtune=westmere
mult took 254624 clocks -march=bonnell -mtune=silvermont
mult took 258577 clocks -march=bonnell -mtune=nehalem
mult took 260612 clocks -march=bonnell -mtune=corei7
mult took 277789 clocks -march=nocona -mtune=nano-x4

---------

And rerunning all the tests for matrix.c on Ryzen using:
-march=$amarch -mtune=$amtune -mprefer-vector-width=none -mno-fma -mno-avx2 -Ofast

The winners were:
mult took 116405 clocks -march=broadwell -mtune=broadwell
mult took 117314 clocks -march=ivybridge -mtune=haswell
mult took 117551 clocks -march=broadwell -mtune=bdver2

Top znver1:
mult took 119951 clocks -march=knl -mtune=znver1
mult took 120442 clocks -march=znver1 -mtune=znver1

Worst cases:
mult took 239640 clocks -march=nehalem -mtune=bdver3
mult took 240623 clocks -march=athlon64-sse3 -mtune=silvermont
mult took 241143 clocks -march=eden-x2 -mtune=nano-2000
mult took 241547 clocks -march=core2 -mtune=intel
mult took 241870 clocks -march=nehalem -mtune=bdver2
mult took 248251 clocks -march=nocona -mtune=intel

The differences between broadwell and znver1 is within the margin of error I would suggest, with these options.
Comment 30 Jan Hubicka 2017-11-29 08:26:30 UTC
Sorry, with -mno-avx2 I was speaking of the other mt benchmark.  There is no need for gathers
in matrix multiplication...

Honza
Comment 31 Andrew Roberts 2017-11-29 09:16:54 UTC
of for mt19937ar with -mno-avx2

/usr/local/gcc/bin/gcc -march=$amarch -mtune=$amtune -mno-avx2 -O3 -o mt199
37ar mt19937ar.c

Top 2:
mt19937ar took 358493 clocks -march=silvermont -mtune=bdver1
mt19937ar took 359933 clocks -march=corei7 -mtune=btver2

Top znver1:
mt19937ar took 363177 clocks -march=znver1 -mtune=k8-sse3
mt19937ar took 373751 clocks -march=slm -mtune=znver1
mt19937ar took 379094 clocks -march=znver1 -mtune=znver1

Worst cases:
mt19937ar took 683339 clocks -march=bdver3 -mtune=btver1
mt19937ar took 687566 clocks -march=btver2 -mtune=haswell
mt19937ar took 695629 clocks -march=athlon64-sse3 -mtune=sandybridge
mt19937ar took 697349 clocks -march=k8-sse3 -mtune=knl
mt19937ar took 697831 clocks -march=knl -mtune=core2
mt19937ar took 798283 clocks -march=opteron -mtune=athlon64-sse3

Running just for: -march=znver1 -mtune=znver1  -Ofast
mt19937ar took 445136 clocks
mt19937ar took 449784 clocks
mt19937ar took 460105 clocks

Running just for: -march=znver1 -mtune=znver1 -mno-avx2 -Ofast
mt19937ar took 416937 clocks
mt19937ar took 389458 clocks
mt19937ar took 389154 clocks

So -mno-avx2 gives 13-14% gain depending on how you look at it.
Comment 32 Andrew Roberts 2017-11-29 17:01:01 UTC
For what its worth, here's what the latest and greatest from the competition has to offer:

/usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -O3 matrix.c -o matrix
        mult took     887141 clocks

/usr/local/llvm-5.0.1-rc2/biznver1 -O3 mt19937ar.c -o mt19937ar
mt19937ar took 402282 clocks

/usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -Ofast matrix.c -o matrix
        mult took     760913 clocks

/usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -Ofast mt19937ar.c -o mt19937ar
mt19937ar took 392527 clocks


current gcc-8 snapshot:
/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1  -Ofast matrix.c -o matrix
        mult took     364775 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1  -Ofast -o mt19937ar mt19937ar.c
mt19937ar took 430804 clocks

current gcc-8 snapshot + extra opts to improve znver1 performance
/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -Ofast matrix.c -o matrix
        mult took     130329 clocks

/usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mno-avx2 -Ofast -o mt19937ar mt19937ar.c
mt19937ar took 387728 clocks

So gcc loses on mt19937ar.c without -mno-avx2
But gcc wins big on matrix.c, especially with -mprefer-vector-width=none -mno-fma
Comment 33 Andrew Roberts 2017-11-29 17:04:28 UTC
That second llvm command line should read:

/usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -Ofast mt19937ar.c -o mt19937ar
Comment 34 Jan Hubicka 2017-11-29 19:55:04 UTC
> So gcc loses on mt19937ar.c without -mno-avx2
> But gcc wins big on matrix.c, especially with -mprefer-vector-width=none
> -mno-fma

It is because llvm does not use vgather at all unless avx512 is present.  I will
look into the vgather cost model tomorrow.

Honza
Comment 35 Jan Hubicka 2017-11-30 09:37:07 UTC
Author: hubicka
Date: Thu Nov 30 09:36:36 2017
New Revision: 255268

URL: https://gcc.gnu.org/viewcvs?rev=255268&root=gcc&view=rev
Log:
	PR target/81616
	* x86-tnue-costs.h (generic_cost): Revise for modern CPUs
	* gcc.target/i386/l_fma_double_1.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_double_2.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_double_3.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_double_4.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_double_5.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_double_6.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_float_1.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_float_2.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_float_3.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_float_4.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_float_5.c: Update count of fma instructions.
	* gcc.target/i386/l_fma_float_6.c: Update count of fma instructions.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune-costs.h
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/l_fma_double_1.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_double_2.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_double_3.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_double_4.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_double_5.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_double_6.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_float_1.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_float_2.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_float_3.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_float_4.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_float_5.c
    trunk/gcc/testsuite/gcc.target/i386/l_fma_float_6.c
Comment 36 Jan Hubicka 2017-12-02 09:23:13 UTC
Author: hubicka
Date: Sat Dec  2 09:22:41 2017
New Revision: 255357

URL: https://gcc.gnu.org/viewcvs?rev=255357&root=gcc&view=rev
Log:

	PR target/81616
	* x86-tune.def: Remove obsolette FIXMEs.
	(X86_TUNE_PARTIAL_FLAG_REG_STALL): Disable for generic
	(X86_TUNE_FUSE_CMP_AND_BRANCH_32, X86_TUNE_FUSE_CMP_AND_BRANCH_64,
	X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, X86_TUNE_FUSE_ALU_AND_BRANCH):
	Enable for generic.
	(X86_TUNE_PAD_RETURNS): Disable for generic.
	* gcc.target/i386/pad-1.c: Compile for amdfam10.
	* gcc.target/i386/align-limit.c: Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/align-limit.c
    trunk/gcc/testsuite/gcc.target/i386/pad-1.c
Comment 37 Jan Hubicka 2017-12-04 23:59:43 UTC
Author: hubicka
Date: Mon Dec  4 23:59:11 2017
New Revision: 255395

URL: https://gcc.gnu.org/viewcvs?rev=255395&root=gcc&view=rev
Log:
	PR target/81616
	* athlon.md: Disable for generic.
	* haswell.md: Enable for generic.
	* i386.c (ix86_sched_init_global): Add core hooks for generic.
	* x86-tune-sched.c (ix86_issue_rate): Increase issue rate for generic
	to 4.
	(ix86_adjust_cost): Move generic to haswell path.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/athlon.md
    trunk/gcc/config/i386/haswell.md
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/x86-tune-sched.c
Comment 38 Martin Jambor 2017-12-13 14:28:43 UTC
Created attachment 42872 [details]
Untested fix for harmful FMAs

(In reply to Jan Hubicka from comment #25)
> Hi, I agree that the matric multiplication fma issue is
> important and hopefully it will be fixed for GCC 8.  See
> https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html

I am testing the attached patch to address the FMA generation.  I plan
to submit it to the mailing list this week if everything goes fine but
I would be very grateful for any comments or additional
testing/benchmarking.

The patch brings the run-time of the matrix.c testcase with native
znver1 tuning down to the levels seen with generic tuning, without it
I see 60% regressions at both -O2 and -O3.  (Even with the patch,
using -mprefer-vector-width=256 can still do quite a bit better but at
least the difference is now 20% and not 100%).
Comment 39 Sebastian Peryt 2017-12-14 15:24:54 UTC
I have tested it on SKX with SPEC2006INT and SPEC2017INT and don't see any regressions.
Comment 40 Martin Jambor 2017-12-15 14:32:57 UTC
(In reply to Sebastian Peryt from comment #39)
> I have tested it on SKX with SPEC2006INT and SPEC2017INT and don't see any
> regressions.

I should have written that the patch only affects znver1 tuning by
default, so if you try to see what the effects are on on another
platform or with some other tuning, you need to add

--param avoid-fma-max-bits=128

or perhaps 256 if that is the preferred vector length with your tuning
(or even 512 on the most modern Intel CPUs?) to the command line.  It
would be interesting to see what the effects of that is on modern
Intel CPUs both on SPEC and the matrix.c example.

Meanwhile, I have submitted the patch to mailing list:

https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01053.html
Comment 41 Jan Hubicka 2018-01-02 09:32:18 UTC
Author: hubicka
Date: Tue Jan  2 09:31:47 2018
New Revision: 256070

URL: https://gcc.gnu.org/viewcvs?rev=256070&root=gcc&view=rev
Log:

	PR target/81616
	* x86-tune-costs.h (generic_cost): Reduce cost of FDIV 20->17,
	cost of sqrt 20->14, DIVSS 18->13, DIVSD 32->17, SQRtSS 30->14
	and SQRTsD 58->18, cond_not_taken_branch_cost. 2->1. Increase
	cond_taken_branch_cost 3->4.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune-costs.h
Comment 42 Jan Hubicka 2018-01-02 13:04:51 UTC
Author: hubicka
Date: Tue Jan  2 13:04:19 2018
New Revision: 256073

URL: https://gcc.gnu.org/viewcvs?rev=256073&root=gcc&view=rev
Log:
	PR target/81616
	* config/i386/x86-tune-costs.h: Increase cost of integer load costs
	for generic 4->6.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune-costs.h
Comment 43 Jan Hubicka 2018-01-10 11:03:26 UTC
Author: hubicka
Date: Wed Jan 10 11:02:55 2018
New Revision: 256424

URL: https://gcc.gnu.org/viewcvs?rev=256424&root=gcc&view=rev
Log:
	PR target/81616
	* i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
	* i386.h (TARGET_USE_GATHER): Define.
	* x86-tune.def (X86_TUNE_USE_GATHER): New.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/x86-tune.def
Comment 44 Martin Jambor 2018-01-12 14:06:42 UTC
Author: jamborm
Date: Fri Jan 12 14:06:10 2018
New Revision: 256581

URL: https://gcc.gnu.org/viewcvs?rev=256581&root=gcc&view=rev
Log:
Deferring FMA transformations in tight loops

2018-01-12  Martin Jambor  <mjambor@suse.cz>

	PR target/81616
	* params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
	* tree-ssa-math-opts.c: Include domwalk.h.
	(convert_mult_to_fma_1): New function.
	(fma_transformation_info): New type.
	(fma_deferring_state): Likewise.
	(cancel_fma_deferring): New function.
	(result_of_phi): Likewise.
	(last_fma_candidate_feeds_initial_phi): Likewise.
	(convert_mult_to_fma): Added deferring logic, split actual
	transformation to convert_mult_to_fma_1.
	(math_opts_dom_walker): New type.
	(math_opts_dom_walker::after_dom_children): New method, body moved
	here from pass_optimize_widening_mul::execute, added deferring logic
	bits.
	(pass_optimize_widening_mul::execute): Moved most of code to
	math_opts_dom_walker::after_dom_children.
	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
	* config/i386/i386.c (ix86_option_override_internal): Added
	maybe_setting of PARAM_AVOID_FMA_MAX_BITS.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/x86-tune.def
    trunk/gcc/params.def
    trunk/gcc/tree-ssa-math-opts.c
Comment 45 Jan Hubicka 2018-01-22 13:56:48 UTC
I believe all issues tracked here has been adressed. Andrew, do you still see some anomalies?

Honza
Comment 46 Andrew Roberts 2018-01-22 18:26:42 UTC
With the latest snapshot:
gcc version 8.0.1 20180121

For the mt19937ar things now look reasonable without any strange options on Ryzen.

Top 5
mt19937ar took 226849 clocks -march=amdfam10 -mtune=btver2
mt19937ar took 228970 clocks -march=amdfam10 -mtune=barcelona
mt19937ar took 229494 clocks -march=bdver1 -mtune=btver1
mt19937ar took 229524 clocks -march=nano -mtune=nano
mt19937ar took 230003 clocks -march=opteron-sse3 -mtune=athlon64-sse3

mt19937ar took 233793 clocks -march=k8-sse3 -mtune=x86-64
mt19937ar took 241700 clocks -march=corei7 -mtune=generic
mt19937ar took 242373 clocks -march=nano-3000 -mtune=znver1
mt19937ar took 245550 clocks -march=k8-sse3 -mtune=haswell
mt19937ar took 251431 clocks -march=znver1 -mtune=generic
mt19937ar took 262200 clocks -march=znver1 -mtune=znver1
mt19937ar took 276993 clocks -march=haswell -mtune=haswell

Bot 5
mt19937ar took 341326 clocks -march=nano-x4 -mtune=silvermont
mt19937ar took 341750 clocks -march=core-avx-i -mtune=nocona
mt19937ar took 342457 clocks -march=k8 -mtune=znver1
mt19937ar took 347453 clocks -march=ivybridge -mtune=bonnell
mt19937ar took 364041 clocks -march=haswell -mtune=core-avx-i

with -mno-avx2
mt19937ar took 235997 clocks -march=znver1 -mtune=opteron
mt19937ar took 233921 clocks -march=nano-1000 -mtune=x86-64
mt19937ar took 243452 clocks -march=znver1 -mtune=x86-64
mt19937ar took 243540 clocks -march=silvermont -mtune=generic
mt19937ar took 247113 clocks -march=znver1 -mtune=generic
mt19937ar took 241368 clocks -march=nano-2000 -mtune=haswell
mt19937ar took 247806 clocks -march=znver1 -mtune=znver1

Compare this with it taking 430875 clocks originally for -march=znver1 -mtune=znver1

On Haswell 

Top 5

mt19937ar took 220000 clocks -march=amdfam10 -mtune=amdfam10
mt19937ar took 220000 clocks -march=amdfam10 -mtune=athlon64
mt19937ar took 220000 clocks -march=amdfam10 -mtune=athlon64-sse3
mt19937ar took 220000 clocks -march=amdfam10 -mtune=athlon-fx
mt19937ar took 220000 clocks -march=amdfam10 -mtune=barcelona

mt19937ar took 220000 clocks -march=corei7-avx -mtune=x86-64
mt19937ar took 230000 clocks -march=haswell -mtune=haswell
mt19937ar took 240000 clocks -march=haswell -mtune=generic
mt19937ar took 260000 clocks -march=haswell -mtune=x86-64

Bot 5 (all various shades of mtune=bdverZ or mtune=btverZ)
mt19937ar took 310000 clocks -march=core-avx2 -mtune=bdver1
mt19937ar took 310000 clocks -march=haswell -mtune=bdver1
mt19937ar took 310000 clocks -march=skylake -mtune=bdver1
Comment 47 Andrew Roberts 2018-01-22 18:47:21 UTC
Again with the latest snapshot:
gcc version 8.0.1 20180121

matrix.c is still needing additional options to get the best out of the Ryzen processor. But is better than before (223029 clocks vs 371978 originally), 
but 122677 is achievable with the right options. However the same can also be said for haswell as things stand. The haswell (-march=haswell -mtune=haswell) time has dropped from 190000 to 23000, but do we put that down to Meltdown/Spectre updates or compiler updates.

With just -O3 on Ryzen:

Top 5
mult took 115669 clocks -march=ivybridge -mtune=skylake-avx512
mult took 118403 clocks -march=corei7-avx -mtune=skylake-avx512
mult took 119379 clocks -march=core-avx-i -mtune=skylake-avx512
mult took 119735 clocks -march=corei7-avx -mtune=skylake
mult took 119901 clocks -march=sandybridge -mtune=broadwell

mult took 120023 clocks -march=sandybridge -mtune=haswell
mult took 121010 clocks -march=corei7-avx -mtune=haswell
mult took 127371 clocks -march=sandybridge -mtune=x86-64
mult took 151208 clocks -march=btver2 -mtune=generic
mult took 152360 clocks -march=ivybridge -mtune=generic
mult took 173926 clocks -march=haswell -mtune=haswell
mult took 177359 clocks -march=znver1 -mtune=athlon64
mult took 180000 clocks -march=ivybridge -mtune=znver1
mult took 188219 clocks -march=znver1 -mtune=generic
mult took 199721 clocks -march=znver1 -mtune=x86-64
mult took 223029 clocks -march=znver1 -mtune=znver1

Bot 5
mult took 377398 clocks -march=znver1 -mtune=bdver3
mult took 377650 clocks -march=knl -mtune=bdver3
mult took 378600 clocks -march=core-avx2 -mtune=bonnell
mult took 381447 clocks -march=skylake-avx512 -mtune=haswell
mult took 388837 clocks -march=skylake-avx512 -mtune=bdver4

On Haswell 

Top 5
mult took 133704 clocks -march=ivybridge -mtune=k8-sse3
mult took 150000 clocks -march=btver2 -mtune=k8
mult took 150000 clocks -march=core-avx-i -mtune=x86-64
mult took 150000 clocks -march=corei7-avx -mtune=nano
mult took 150000 clocks -march=corei7-avx -mtune=opteron

mult took 160000 clocks -march=core-avx-i -mtune=haswell
mult took 190000 clocks -march=haswell -mtune=eden-x4
mult took 190000 clocks -march=ivybridge -mtune=generic
mult took 200000 clocks -march=haswell -mtune=x86-64
mult took 230000 clocks -march=haswell -mtune=haswell
mult took 270000 clocks -march=haswell -mtune=generic

Bot 5
mult took 420000 clocks -march=skylake-avx512 -mtune=bdver2
mult took 420000 clocks -march=znver1 -mtune=bdver3
mult took 420000 clocks -march=znver1 -mtune=bdver4
mult took 430000 clocks -march=bdver2 -mtune=bdver2
mult took 430000 clocks -march=knl -mtune=bdver2

Using 
-mprefer-vector-width=none -mno-fma -mno-avx2 -O3

On Ryzen
Top 5
mult took 116558 clocks -march=haswell -mtune=bdver3
mult took 116673 clocks -march=haswell -mtune=skylake
mult took 117268 clocks -march=sandybridge -mtune=skylake-avx512
mult took 117288 clocks -march=broadwell -mtune=nocona
mult took 118450 clocks -march=corei7-avx -mtune=haswell

mult took 119719 clocks -march=core-avx-i -mtune=znver1
mult took 120028 clocks -march=znver1 -mtune=skylake
mult took 122677 clocks -march=znver1 -mtune=znver1
mult took 123423 clocks -march=haswell -mtune=haswell
mult took 127388 clocks -march=skylake -mtune=x86-64
mult took 130475 clocks -march=znver1 -mtune=x86-64
mult took 132374 clocks -march=sandybridge -mtune=generic
mult took 162317 clocks -march=znver1 -mtune=generic

Bot 5
mult took 300000 clocks -march=nano-x2 -mtune=btver2
mult took 310000 clocks -march=skylake-avx512 -mtune=westmere
mult took 319772 clocks -march=knl -mtune=sandybridge
mult took 320000 clocks -march=eden-x2 -mtune=amdfam10
mult took 330000 clocks -march=atom -mtune=broadwell

On Haswell

Top 5
mult took 123148 clocks -march=bonnell -mtune=ivybridge
mult took 130262 clocks -march=ivybridge -mtune=silvermont
mult took 135299 clocks -march=core-avx2 -mtune=nano-3000
mult took 150000 clocks -march=core-avx2 -mtune=intel
mult took 150000 clocks -march=haswell -mtune=btver1

mult took 170000 clocks -march=core-avx-i -mtune=haswell
mult took 170000 clocks -march=znver1 -mtune=x86-64
mult took 180000 clocks -march=haswell -mtune=haswell
mult took 180000 clocks -march=znver1 -mtune=generic
mult took 210000 clocks -march=haswell -mtune=generic
mult took 230000 clocks -march=haswell -mtune=x86-64

Bot 5
mult took 350000 clocks -march=nano-x4 -mtune=nano-2000
mult took 350000 clocks -march=slm -mtune=skylake-avx512
mult took 360000 clocks -march=barcelona -mtune=broadwell
mult took 360000 clocks -march=nano -mtune=corei7
mult took 360000 clocks -march=nocona -mtune=btver2
Comment 48 Andrew Roberts 2018-01-22 18:48:58 UTC
Correction, that should be 230000 not 23000 for the haswell drop in performance.
Comment 49 Jan Hubicka 2018-01-22 19:58:56 UTC
> matrix.c is still needing additional options to get the best out of the Ryzen
> processor. But is better than before (223029 clocks vs 371978 originally), 
> but 122677 is achievable with the right options. However the same can also be

Aha, for ryzen we would still benefit from 256 vectorization. It is not a win
overall and it will need bigger surgey to vectorizer to implement properly, so
that will wait for next stage1 unfortunately.

This is the gap between -march=znver1 -mtune=generic and -march=znver1, so about
17%

Concerning your options -mprefer-vector-width=none -mno-fma -mno-avx2 -O3
With Martin's patch in -mno-fma should no longer have effect here.  Not sure
why -mno-avx2 would be a win either. We originally introduced it to disable
scatter/gather in the other benchmark but that one is solved too.
Do those two option still improve the scores for you.

It is alaso mystery to me why -march=ivybridge would benefit anything as the
isa is more or less supperset of znver. I will try to find more to check more.

Honza
Comment 50 Andrew Roberts 2018-01-23 03:55:24 UTC
with the matrix.c benchmark on Ryzen and looking at the other options when using -march=znver1 and -mtune=znver1

mult took 225281 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128
mult took 185961 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256
mult took 187577 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512

-adding mno-avx2 has no effect on the above baseline.

adding in -mno-fma

mult took 223302 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128 -mno-fma
mult took 123773 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256 -mno-fma
mult took 124690 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512 -mno-fma

Is the patch in trunk yet? I was assuming it was from the other comments.

using -march=ivybridge but keeping the rest of the options:
mult took 215052 clocks -march=ivybridge -mtune=znver1 -mprefer-vector-width=128   -mno-fma
mult took 121661 clocks -march=ivybridge -mtune=znver1 -mprefer-vector-width=256 -mno-fma
mult took 131763 clocks -march=ivybridge -mtune=znver1 -mprefer-vector-width=512 -mno-fma

Switching to -march=ivybridge -mtune=skylake-avx512 and dropping the other options (and still on Ryzen)
mult took 119195 clocks -march=ivybridge -mtune=skylake-avx512 

With -march=znver1 -mtune=skylake-avx512 and dropping the other options
mult took 182799 clocks -march=znver1 -mtune=skylake-avx512

So the combination of -march=ivybridge -mtune=skylake-avx512 is doing something right.
Comment 51 Martin Jambor 2018-01-23 17:26:34 UTC
(In reply to Andrew Roberts from comment #50)
> with the matrix.c benchmark on Ryzen and looking at the other options when
> using -march=znver1 and -mtune=znver1
> 
> mult took 225281 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128
> mult took 185961 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256
> mult took 187577 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512
> 
> -adding mno-avx2 has no effect on the above baseline.
> 
> adding in -mno-fma
> 
> mult took 223302 clocks -march=znver1 -mtune=znver1
> -mprefer-vector-width=128 -mno-fma
> mult took 123773 clocks -march=znver1 -mtune=znver1
> -mprefer-vector-width=256 -mno-fma
> mult took 124690 clocks -march=znver1 -mtune=znver1
> -mprefer-vector-width=512 -mno-fma
> 
> Is the patch in trunk yet? I was assuming it was from the other comments.

Yes, but by default (on Zen) it only prevents generating FMAs for
128bit operands (or smaller).  Originally, AMD kept 256bit ones or
larger intact in their splitting patch (and in a conversation they
hinted that they might be beneficial in some scenarios) and I kept the
condition there because 256bit vectors are not well understood and I
had little time.

We will definitely look at this whe examining AVX256 on Zen.  I am not
sure whether want to lift the restriction only based on matrix.c in
stage 4.  But I would not oppose it.
Comment 52 Richard Biener 2019-04-11 12:26:21 UTC
Fixed?  Or shall we take it as recurring bug?
Comment 53 Martin Jambor 2019-04-17 09:40:52 UTC
I'd vote for marking this fixed (and asking anyone with other ideas what could be improved in generic tuning to open a new bug).
Comment 54 Jan Hubicka 2019-04-18 13:24:40 UTC
Yep, I think we could declare this as fixed.
The cost tuning seems to work reasonably well for cores and zens.
Comment 55 GCC Commits 2022-12-07 08:45:03 UTC
The master branch has been updated by Hongyu Wang <hongyuw@gcc.gnu.org>:

https://gcc.gnu.org/g:3a1a141f79c83ad38f7db3a21d8a4dcfe625c176

commit r13-4534-g3a1a141f79c83ad38f7db3a21d8a4dcfe625c176
Author: Hongyu Wang <hongyu.wang@intel.com>
Date:   Tue Dec 6 09:53:35 2022 +0800

    i386: Avoid fma_chain for -march=alderlake and sapphirerapids.
    
    For Alderlake there is similar issue like PR 81616, enable
    avoid_fma256_chain will also benefit on Intel latest platforms
    Alderlake and Sapphire Rapids.
    
    gcc/ChangeLog:
    
            * config/i386/x86-tune.def (X86_TUNE_AVOID_256FMA_CHAINS): Add
            m_SAPPHIRERAPIDS, m_ALDERLAKE and m_CORE_ATOM.