-mtune=generic should be updated for the current Intel and AMD processors.
*** Bug 81614 has been marked as a duplicate of this bug. ***
Confirmed. Honza is working on this.
I am mostly done with my tuning overhaul for core+ and znver and I plan to work on generic now in early stage3. My rough plan is - drop flags that are there for benefit of anything earlier than core2 and buldozer - base costs of instructions on Haswell (and later)+ZNver1 latencies, keep in mind buldozers. - revisit code alignment strategies. It seems to me that by default we align way too much for both core and Zen. Maybe code alignment does not pay back at all for -O2 and should be done at -Ofast only or so. - switch instruction scheduling to more modern chip (currently we schedule for K8). Here I need to figure out how much core based chips care about particular scheduler model, but I suspect both core and Zen are quite neutral here and mostly benefit from basic scheduling for latencies. - figure out best vectorization model - here AVX may be a fun, because core and znver preffers different kind of codegen. Ideas are welcome.
I've been testing on a Ryzen system and also comparing with Haswell and Skylake. From my testing -mtune=znver1 does not perform well and never has, including as of last snapshot: gcc version 8.0.0 20171119 (experimental) (GCC) -mtune=generic seems a better option for all three systems as a default for -march=native This is only with one test case (attached), but I've seen the same across many other tests. See the attached testcase (matix.c) and performance logs Ryzen - znver1-tunebug.txt Haswell - znver1-tunebug2.txt Skylake - znver1-tunebug3.txt
Created attachment 42687 [details] Test program used for the attached performance results (matrix.c) Test program used for the attached performance results (matrix.c)
Created attachment 42688 [details] Test results for Ryzen system with matrix.c Test results for Ryzen system with matrix.c
Created attachment 42689 [details] Test results for Haswell system with matrix.c Test results for Haswell system with matrix.c
Created attachment 42690 [details] Test results for Skylake system with matrix.c Test results for Skylake system with matrix.c
Created attachment 42691 [details] Script for matrix.c test program Script for matrix.c test program
I've been also wondering if the ISA selection shouldn't affect -mtune=generic tuning, say in TUs or even just functions that have AVX512* enabled the generic tuning shouldn't be taken just from the set of CPUs that currently support that ISA. Of course that would change once some AMD chips start supporting it.
Ok I've tried again with this weeks snapshot: gcc version 8.0.0 20171126 (experimental) (GCC) Taking combination of -march and -mtune which works well on Ryzen: /usr/local/gcc/bin/gcc -march=core-avx-i -mtune=nocona -O3 matrix.c -o matrix ./matrix mult took 131153 clocks Then switching to -mtune=znver1 /usr/local/gcc/bin/gcc -march=core-avx-i -mtune=znver1 -O3 matrix.c -o matrix ./matrix mult took 231309 clocks Then looking at the differences in the -Q --help=target output for these two and eliminating each difference at a time, I found that: gcc -march=core-avx-i -mtune=znver1 -mprefer-vector-width=none -O3 matrix.c -o matrix [aroberts@ryzen share]$ ./matrix mult took 132295 clocks The default for znver1 is: -mprefer-vector-width=128 So is this option still helping with the latest microcode? Not in this case at least. cat /proc/cpuinfo : processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD Ryzen 7 1700 Eight-Core Processor stepping : 1 microcode : 0x8001129 with -march=znver1 -mtune=znver1 with default of -mprefer-vector-width=128 mult took 386291 clocks with -march=znver1 -mtune=znver1 -mprefer-vector-width=none mult took 201455 clocks
> So is this option still helping with the latest microcode? Not in this case at > least. It is on my TODO list to re-benchmark 256bit vectorization for Zen. I do not think microcode is a big difference here. Using 256 bit vectors has advantage of exposing more of parallelism but also disadvantage of requiring more involved setup. So for loops that vectorize naturally (like matrix multiplication) it can be win, while for loops that are difficult to vectorize it is a loss. So I think the early benchmarks did not look consistent and it is why 128bit mode was introduced. It is not that different form vectorizing for K8 which had split SSE registers in a similar fashion or for kabylake which splits 512 bit operations. While rewriting the cost-model I tried to keep this in mind and more acurately model the split operations, so it may be possible to switch to 256 by default. Ideally vectorizer should make a deicsion whether 128 or 256 is win for partiuclar loop but it doesn't seem to have infrastructure to do so. My plan is to split current flag into two - preffer 128bit and assume that registers are internally split and see if that is enough to get consistent win for 256 bit vectorization. Richi may know better. Honza
It would be nice if znver1 for -march and -mtune could be improved before the gcc 8 release. At present -march=znver1 -mtune=znver1 looks be to about the worst thing you could do, and not just on this vectorizable code. And given we tell people to use -march=native which gives this, it would be nice to improve. With the attached example switching to larger vectors still only gets to 200000 clocks, whereas other combinations get down to 116045 mult took 116045 clocks -march=corei7-avx -mtune=skylake So there is more going on here than just the vector length. If there is any testing to isolate other options I would be happy to help, just point me in the right direction. If there are good (open) benchmarks I can routinely test on a range of targets I would be happy to. I have ryzen, haswell, skylake, arm, aarch64, etc.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616 > > --- Comment #14 from Andrew Roberts <andrewm.roberts at sky dot com> --- > It would be nice if znver1 for -march and -mtune could be improved before the > gcc 8 release. At present -march=znver1 -mtune=znver1 looks be to about the > worst thing you could do, and not just on this vectorizable code. And given we > tell people to use -march=native which gives this, it would be nice to improve. We benchmarked znver1 tuning quite thoroughly with spec2000, spec2006 and 2017 and istuation is not that bad. In August, with -O2 native tuning is about 0.3% (for both in and fp) better than generic (this does not include vectorization becuase of -O2 and keep in mind that spec is often bound by memory, 0.3% difference is quite noticable). All regressions in individual benchmarks were under 2% and some fixed since then. For -Ofast the difference is about 0.5% for integer with two notable regressions wich have WIP solutions for. Integer/core tuning went worse than generic so things was as indtended. I will quickly re-test 256bit vectorization with specfp2k (that is fast). Please attach regressing testcases you have and I will take a look, too. Honza
(In reply to Jan Hubicka from comment #13) > > So is this option still helping with the latest microcode? Not in this case at > > least. > > It is on my TODO list to re-benchmark 256bit vectorization for Zen. I do not > think microcode is a big difference here. Using 256 bit vectors has > advantage > of exposing more of parallelism but also disadvantage of requiring more > involved setup. So for loops that vectorize naturally (like matrix > multiplication) it can be win, while for loops that are difficult to > vectorize > it is a loss. So I think the early benchmarks did not look consistent and it > is > why 128bit mode was introduced. > > It is not that different form vectorizing for K8 which had split SSE > registers > in a similar fashion or for kabylake which splits 512 bit operations. > > While rewriting the cost-model I tried to keep this in mind and more > acurately > model the split operations, so it may be possible to switch to 256 by > default. > > Ideally vectorizer should make a deicsion whether 128 or 256 is win for > partiuclar loop but it doesn't seem to have infrastructure to do so. > My plan is to split current flag into two - preffer 128bit and assume > that registers are internally split and see if that is enough to get > consistent > win for 256 bit vectorization. > > Richi may know better. The vectorizer cannot currently evaluate both (or multiple) vector length vectorization costs against each other. Doing so with the current implementation would have prohibitive cost (basically do the analysis phase twice and if unlucky and the "first" wins, re-do analysis phase of the winner). Hmm, maybe not _too_ bad in the end... But first and foremost costing is not aware of split AVX256 penalties, so I'm not sure if doing the above would help. I can cook up some "quick" prototype (maybe hidden behind a --param paywall) so one could benchmark such mode. Is there interest? > Honza
The general consensus in userland is that the znver1 optimization is much worse than 0.5%, or even 2% off. Most people are using -march=haswell if they care about performance. Just taking one part of one of my apps I see a 5% difference with -march=haswell vs -march=znver1, and this is just general code (loading GL extensions). The trick is to remove system dependencies from things I could benchmark. If there are no recommendations, I'll come up with some tests myself for various workloads, and try across various march/tune combos. I'll also look at some other real world benchmarks that are available online.
Ok trying an entirely different algorith, same results: Using Mersenne Twister algorithm from here: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html alter main program to comment out original test harness, and replace main with: int main(void) { int i; unsigned long init[4]={0x123, 0x234, 0x345, 0x456}, length=4; init_by_array(init, length); clock_t e, s=clock(); int j=genrand_int32(); for(i=0; i<100000000; i++) { j ^= genrand_int32(); } e=clock(); if (j != -549769613) printf("Error j != -549769613 (%d)\n", j); printf("mt19937ar took %ld clocks ", (long)(e-s)); return 0; } So nothing complicated. On Ryzen: -------- Top 5: mt19937ar took 354877 clocks -march=amdfam10 -mtune=k8 mt19937ar took 356203 clocks -march=bdver2 -mtune=eden-x2 mt19937ar took 356534 clocks -march=nano-x2 -mtune=nano-1000 mt19937ar took 357321 clocks -march=athlon-fx -mtune=nano-x4 mt19937ar took 357634 clocks -march=bdver3 -mtune=nano-x2 Bot 5: mt19937ar took 675052 clocks -march=nano -mtune=btver1 mt19937ar took 679826 clocks -march=k8 -mtune=nocona mt19937ar took 681118 clocks -march=opteron -mtune=atom mt19937ar took 689604 clocks -march=core2 -mtune=broadwell mt19937ar took 699840 clocks -march=skylake -mtune=generic Top -mtune=znver1 mt19937ar took 369722 clocks -march=nano-x2 -mtune=znver1 Top -march=znver1 mt19937ar took 375286 clocks -march=znver1 -mtune=silvermont -march=znver1 -mtune=znver1 (aka native) mt19937ar took 430875 clocks -march=znver1 -mtune=znver1 -march=haswell -mtune=haswell mt19937ar took 402963 clocks -march=haswell -mtune=haswell -march=k8 -mtune=k8 mt19937ar took 367890 clocks -march=k8 -mtune=k8 so -march=znver1 -mtune=znver1 is: 7% slower than tuning for haswell 17% slower than tuning for k8 Again -mtune=znver1, -mtune=bdverX, -mtune=btverX all cluster at the bottom On Haswell: ---------- Top 5: mt19937ar took 290000 clocks -march=amdfam10 -mtune=barcelona mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver1 mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver2 mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver3 mt19937ar took 290000 clocks -march=amdfam10 -mtune=bdver4 Bot 5: mt19937ar took 370000 clocks -march=znver1 -mtune=bdver3 mt19937ar took 370000 clocks -march=znver1 -mtune=bdver4 mt19937ar took 370000 clocks -march=znver1 -mtune=btver2 mt19937ar took 370000 clocks -march=znver1 -mtune=znver1 mt19937ar took 380000 clocks -march=knl -mtune=bdver1 Top -mtune=haswell mt19937ar took 300000 clocks -march=bdver4 -mtune=haswell Top -march=haswell mt19937ar took 300000 clocks -march=haswell -mtune=broadwell -march=haswell -mtune=haswell (aka native) mt19937ar took 300000 clocks -march=haswell -mtune=haswell Best performing pair: mt19937ar took 290000 clocks -march=barcelona -mtune=barcelona so the haswell options are pretty much optimal on that hardware as from other test.
Created attachment 42735 [details] modified mt19937ar test program, test script and results modified mt19937ar test program, test script and results tar -tf mt19937ar-test.tar.gz ./doit.csh <= Test script, change path to gcc! ./mt19937ar.c <= main function altered to give test results ./mt19937ar-haswell.txt <= full results on Intel Core i5-4570S ./mt19937ar-ryzen.txt <= full results on AMD Ryzen 7 1700 Eight-Core Processor
Again those latest mt19937ar results above were with the current snapshot: /usr/local/gcc/bin/gcc -v Using built-in specs. COLLECT_GCC=/usr/local/gcc/bin/gcc COLLECT_LTO_WRAPPER=/usr/local/gcc-8.0.0/libexec/gcc/x86_64-unknown-linux-gnu/8.0.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ../gcc-8.0.0/configure --prefix=/usr/local/gcc-8.0.0 --program-suffix= --disable-werror --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --enable-gnu-indirect-function --with-isl --enable-languages=c,c++,fortran,lto --disable-libgcj --enable-lto --enable-multilib --with-tune=generic --with-arch_32=i686 --host=x86_64-unknown-linux-gnu --build=x86_64-unknown-linux-gnu --disable-bootstrap Thread model: posix gcc version 8.0.0 20171126 (experimental) (GCC)
Hi, this is comparing SPEC2000 -Ofast -march=native -mprefer-vector-width=128 to -Ofast -march=native -mprefer-vector-width=256 on Ryzen. 168.wupwise 1600 28.2 5669 * 1600 30.8 5187 * 171.swim 3100 26.4 11763 * 3100 27.5 11261 * 172.mgrid 1800 26.1 6907 * 1800 30.9 5827 * 173.applu 2100 25.5 8234 * 2100 25.7 8161 * 177.mesa 1400 23.4 5971 * 1400 23.2 6030 * 178.galgel X X 179.art 2600 10.9 23752 * 2600 10.9 23777 * 183.equake 1300 12.9 10047 * 1300 12.9 10063 * 187.facerec 1900 17.2 11025 * 1900 24.0 7921 * 188.ammp 2200 34.2 6431 * 2200 34.4 6397 * 189.lucas 2000 20.3 9859 * 2000 20.4 9807 * 191.fma3d 2100 29.7 7061 * 2100 31.4 6694 * 200.sixtrack 1100 38.8 2834 * 1100 41.5 2648 * 301.apsi 2600 33.0 7873 * 2600 33.1 7856 * Est. SPECfp_base2000 8049 Est. SPECfp2000 7590 164.gzip 1400 57.1 2450 * 1400 58.0 2413 * 175.vpr 1400 37.4 3746 * 1400 37.5 3733 * 176.gcc 1100 20.2 5450 * 1100 20.0 5489 * 181.mcf 1800 21.7 8310 * 1800 21.4 8402 * 186.crafty 1000 20.5 4874 * 1000 20.9 4794 * 197.parser 1800 51.7 3481 * 1800 51.5 3498 * 252.eon 1300 18.2 7154 * 1300 19.2 6759 * 253.perlbmk X X 254.gap X X 255.vortex X X 256.bzip2 1500 42.6 3522 * 1500 42.9 3496 * 300.twolf 3000 56.5 5313 * 3000 56.3 5330 * Est. SPECint_base2000 4612 Est. SPECint2000 4575 So it does not seem to be win in general. I will compare with -mtune=haswell now
Hi, this is same base (so you can see there is some noise) compared to haswell tuning 164.gzip 1400 57.1 2452 * 1400 58.7 2384 * 175.vpr 1400 37.1 3776 * 1400 38.3 3659 * 176.gcc 1100 20.0 5500 * 1100 20.1 5464 * 181.mcf 1800 21.6 8327 * 1800 20.9 8617 * 186.crafty 1000 20.4 4905 * 1000 21.0 4760 * 197.parser 1800 51.3 3506 * 1800 51.9 3466 * 252.eon 1300 18.2 7162 * 1300 19.2 6781 * 253.perlbmk X X 254.gap X X 255.vortex X X 256.bzip2 1500 42.4 3537 * 1500 44.1 3401 * 300.twolf 3000 56.4 5317 * 3000 56.3 5328 * Est. SPECint_base2000 4632 Est. SPECint2000 4548 168.wupwise 1600 28.2 5667 * 1600 28.7 5580 * 171.swim 3100 26.3 11807 * 3100 27.4 11304 * 172.mgrid 1800 26.0 6930 * 1800 31.0 5810 * 173.applu 2100 25.5 8239 * 2100 25.6 8193 * 177.mesa 1400 23.4 5970 * 1400 22.9 6116 * 178.galgel X X 179.art 2600 10.9 23807 * 2600 10.4 25014 * 183.equake 1300 12.9 10039 * 1300 12.9 10060 * 187.facerec 1900 17.3 11009 * 1900 20.8 9135 * 188.ammp 2200 34.2 6441 * 2200 34.2 6428 * 189.lucas 2000 20.7 9683 * 2000 20.7 9679 * 191.fma3d 2100 29.7 7060 * 2100 31.5 6660 * 200.sixtrack 1100 38.6 2847 * 1100 40.9 2687 * 301.apsi 2600 33.1 7866 * 2600 32.7 7952 * Est. SPECfp_base2000 8045 Est. SPECfp2000 7766 So mes, arta and mcf sems to benefit from Haswell tunning. Mesa is vectorization problem (we vectorize cold loop and introduce too much of register pressure) What is however interesting is that zen tuning with 256bit vectorization seems to be worse than haswell tuning. I will run haswell with 128bit vector size. What your matrix multiplication benchmark runs into is issue with multiply and add instruction. Once machine is free I will try it, but disabling fmadd may solve the regression. Honza Honza
Thanks Honza, getting closer, with original matrix.c on Ryzen: /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -O3 matrix.c -o matrix mult took 364850 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -O3 matrix.c -o matrix mult took 194517 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -O3 matrix.c -o matrix mult took 130343 clocks /usr/local/gcc/bin/gcc -march=haswell -mtune=haswell -mprefer-vector-width=none -mno-fma -O3 matrix.c -o matrix mult took 130129 clocks These last two are comparable with the fastest obtained from trying all combinations of -march and -mtune
For the mt19937ar test: /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -O3 mt19937ar.c -o mt19937ar mt19937ar took 462062 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -O3 mt19937ar.c -o mt19937ar mt19937ar took 412449 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -O3 mt19937ar.c -o mt19937ar mt19937ar took 419284 clocks /usr/local/gcc/bin/gcc -march=haswell -mtune=haswell -mprefer-vector-width=none -mno-fma -O3 mt19937ar.c -o mt19937ar mt19937ar took 436768 clocks /usr/local/gcc/bin/gcc -march=corei7-avx -mtune=skylake -O3 mt19937ar.c -o mt19937ar mt19937ar took 410302 clocks
Hi, I agree that the matric multiplication fma issue is important and hopefully it will be fixed for GCC 8. See https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html The irregularity of tune/arch is probably originating from enabling/disabling fma and avx256 preferrence. I get jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=native -mno-fma mult.c jh@d136:~> ./a.out mult took 193593 clocks jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=native -mno-fma -mprefer-vector-width=256 mult.c jh@d136:~> ./a.out mult took 104745 clocks jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=haswell -mprefer-vector-width=256 mult.c jh@d136:~> ./a.out mult took 160123 clocks jh@d136:~> /home/jh/trunk-install-new3/bin/gcc -Ofast -march=haswell -mprefer-vector-width=256 -mno-fma mult.c jh@d136:~> ./a.out mult took 102048 clocks 90% difference on a common loop is quite noticeable. Continuing my benchmarkings on spec2k. This is -Ofast -march=native -mprefer-vector-width=none compared to -Ofast -march=native -mtune=haswell -mprefer-vector-width=128. So neither of those are win compared to -mtune=native. 164.gzip 1400 58.2 2407 * 1400 57.9 2419 * 175.vpr 1400 37.5 3731 * 1400 37.8 3704 * 176.gcc 1100 20.0 5494 * 1100 20.0 5497 * 181.mcf 1800 21.6 8324 * 1800 20.8 8660 * 186.crafty 1000 20.9 4790 * 1000 21.2 4722 * 197.parser 1800 51.4 3499 * 1800 51.8 3472 * 252.eon 1300 19.3 6749 * 1300 18.2 7143 * 253.perlbmk X X 254.gap X X 255.vortex X X 256.bzip2 1500 43.1 3483 * 1500 43.5 3444 * 300.twolf 3000 56.6 5302 * 3000 57.0 5267 * Est. SPECint_base2000 4563 Est. SPECint2000 4591 168.wupwise 1600 30.9 5179 * 1600 29.7 5387 * 171.swim 3100 27.4 11309 * 3100 26.4 11739 * 172.mgrid 1800 31.0 5814 * 1800 26.1 6895 * 173.applu 2100 25.7 8175 * 2100 25.9 8096 * 177.mesa 1400 23.3 6006 * 1400 23.3 6001 * 178.galgel X X 179.art 2600 11.0 23702 * 2600 11.0 23718 * 183.equake 1300 13.0 10033 * 1300 13.1 9944 * 187.facerec 1900 24.0 7931 * 1900 17.2 11040 * 188.ammp 2200 34.4 6394 * 2200 35.2 6249 * 189.lucas 2000 20.3 9864 * 2000 20.8 9603 * 191.fma3d 2100 31.4 6686 * 2100 30.0 7011 * 200.sixtrack 1100 41.7 2641 * 1100 38.5 2856 * 301.apsi 2600 34.1 7630 * 2600 34.2 7612 * Est. SPECfp_base2000 7570 Est. SPECfp2000 7947
On you matrix benchmarks I get: Vector inside of loop cost: 44 Vector prologue cost: 12 Vector epilogue cost: 0 Scalar iteration cost: 40 Scalar outside cost: 0 Vector outside cost: 12 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:15:7: note: Runtime profitability threshold = 4 mult.c:15:7: note: Static estimate profitability threshold = 4 Vector inside of loop cost: 2428 Vector prologue cost: 4 Vector epilogue cost: 0 Scalar iteration cost: 2428 Scalar outside cost: 0 Vector outside cost: 4 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:30:7: note: Runtime profitability threshold = 4 mult.c:30:7: note: Static estimate profitability threshold = 4 for 128bit vectorization and for 256bit Vector inside of loop cost: 88 Vector prologue cost: 24 Vector epilogue cost: 0 Scalar iteration cost: 40 Scalar outside cost: 0 Vector outside cost: 24 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:15:7: note: Runtime profitability threshold = 8 mult.c:15:7: note: Static estimate profitability threshold = 8 Vector inside of loop cost: 6472 Vector prologue cost: 8 Vector epilogue cost: 0 Scalar iteration cost: 2428 Scalar outside cost: 0 Vector outside cost: 8 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:30:7: note: Runtime profitability threshold = 8 mult.c:30:7: note: Static estimate profitability threshold = 8 So if vectorizer knew to preffer bigger vector sizes when cost is about double, it would vectoriye first loop to 256 as expected.
Hi, one of problem here is use of vgather instruction. It is hardly a win on Zen architecture. It is also on my TODO to adjust the code model to disable it for most loops. I only want to benchmark if it is a win at all in some cases or not at all to set proper weights. You can disable it with -mno-avx2 Still the code is bit worse than for -march=amdfam10 -mtune=k8 which is bit funny. I will take a look at that. Honza
Adding -mno-avx2 into the mix was a marginal win, but only just showing out of the noise: /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -mno-avx2 -O3 matrix.c -o matrix mult took 121397 clocks mult took 124373 clocks mult took 125345 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -O3 matrix.c -o matrix mult took 123262 clocks mult took 128193 clocks mult took 125891 clocks Using -Ofast instead of -O3 /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -Ofast matrix.c -o matrix mult took 125163 clocks mult took 123799 clocks mult took 122808 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -mno-avx2 -Ofast matrix.c -o matrix mult took 130189 clocks mult took 122726 clocks mult took 123686 clocks
And rerunning all the tests for matrix.c on Ryzen using: -march=$amarch -mtune=$amtune -mprefer-vector-width=none -mno-fma -O3 The winners were: mult took 118145 clocks -march=broadwell -mtune=broadwell mult took 118912 clocks -march=core-avx2 -mtune=core-avx2 Top -mtune=znver1 mult took 121845 clocks -march=core-avx2 -mtune=znver1 mult took 129241 clocks -march=znver1 -mtune=znver1 And the bottom of the list no longer has a cluster of -mtune= btverX, bdverX, znver1 Worst cases: mult took 253400 clocks -march=x86-64 -mtune=haswell mult took 254006 clocks -march=bonnell -mtune=westmere mult took 254624 clocks -march=bonnell -mtune=silvermont mult took 258577 clocks -march=bonnell -mtune=nehalem mult took 260612 clocks -march=bonnell -mtune=corei7 mult took 277789 clocks -march=nocona -mtune=nano-x4 --------- And rerunning all the tests for matrix.c on Ryzen using: -march=$amarch -mtune=$amtune -mprefer-vector-width=none -mno-fma -mno-avx2 -Ofast The winners were: mult took 116405 clocks -march=broadwell -mtune=broadwell mult took 117314 clocks -march=ivybridge -mtune=haswell mult took 117551 clocks -march=broadwell -mtune=bdver2 Top znver1: mult took 119951 clocks -march=knl -mtune=znver1 mult took 120442 clocks -march=znver1 -mtune=znver1 Worst cases: mult took 239640 clocks -march=nehalem -mtune=bdver3 mult took 240623 clocks -march=athlon64-sse3 -mtune=silvermont mult took 241143 clocks -march=eden-x2 -mtune=nano-2000 mult took 241547 clocks -march=core2 -mtune=intel mult took 241870 clocks -march=nehalem -mtune=bdver2 mult took 248251 clocks -march=nocona -mtune=intel The differences between broadwell and znver1 is within the margin of error I would suggest, with these options.
Sorry, with -mno-avx2 I was speaking of the other mt benchmark. There is no need for gathers in matrix multiplication... Honza
of for mt19937ar with -mno-avx2 /usr/local/gcc/bin/gcc -march=$amarch -mtune=$amtune -mno-avx2 -O3 -o mt199 37ar mt19937ar.c Top 2: mt19937ar took 358493 clocks -march=silvermont -mtune=bdver1 mt19937ar took 359933 clocks -march=corei7 -mtune=btver2 Top znver1: mt19937ar took 363177 clocks -march=znver1 -mtune=k8-sse3 mt19937ar took 373751 clocks -march=slm -mtune=znver1 mt19937ar took 379094 clocks -march=znver1 -mtune=znver1 Worst cases: mt19937ar took 683339 clocks -march=bdver3 -mtune=btver1 mt19937ar took 687566 clocks -march=btver2 -mtune=haswell mt19937ar took 695629 clocks -march=athlon64-sse3 -mtune=sandybridge mt19937ar took 697349 clocks -march=k8-sse3 -mtune=knl mt19937ar took 697831 clocks -march=knl -mtune=core2 mt19937ar took 798283 clocks -march=opteron -mtune=athlon64-sse3 Running just for: -march=znver1 -mtune=znver1 -Ofast mt19937ar took 445136 clocks mt19937ar took 449784 clocks mt19937ar took 460105 clocks Running just for: -march=znver1 -mtune=znver1 -mno-avx2 -Ofast mt19937ar took 416937 clocks mt19937ar took 389458 clocks mt19937ar took 389154 clocks So -mno-avx2 gives 13-14% gain depending on how you look at it.
For what its worth, here's what the latest and greatest from the competition has to offer: /usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -O3 matrix.c -o matrix mult took 887141 clocks /usr/local/llvm-5.0.1-rc2/biznver1 -O3 mt19937ar.c -o mt19937ar mt19937ar took 402282 clocks /usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -Ofast matrix.c -o matrix mult took 760913 clocks /usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -Ofast mt19937ar.c -o mt19937ar mt19937ar took 392527 clocks current gcc-8 snapshot: /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -Ofast matrix.c -o matrix mult took 364775 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -Ofast -o mt19937ar mt19937ar.c mt19937ar took 430804 clocks current gcc-8 snapshot + extra opts to improve znver1 performance /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mprefer-vector-width=none -mno-fma -Ofast matrix.c -o matrix mult took 130329 clocks /usr/local/gcc/bin/gcc -march=znver1 -mtune=znver1 -mno-avx2 -Ofast -o mt19937ar mt19937ar.c mt19937ar took 387728 clocks So gcc loses on mt19937ar.c without -mno-avx2 But gcc wins big on matrix.c, especially with -mprefer-vector-width=none -mno-fma
That second llvm command line should read: /usr/local/llvm-5.0.1-rc2/bin/clang -march=znver1 -mtune=znver1 -Ofast mt19937ar.c -o mt19937ar
> So gcc loses on mt19937ar.c without -mno-avx2 > But gcc wins big on matrix.c, especially with -mprefer-vector-width=none > -mno-fma It is because llvm does not use vgather at all unless avx512 is present. I will look into the vgather cost model tomorrow. Honza
Author: hubicka Date: Thu Nov 30 09:36:36 2017 New Revision: 255268 URL: https://gcc.gnu.org/viewcvs?rev=255268&root=gcc&view=rev Log: PR target/81616 * x86-tnue-costs.h (generic_cost): Revise for modern CPUs * gcc.target/i386/l_fma_double_1.c: Update count of fma instructions. * gcc.target/i386/l_fma_double_2.c: Update count of fma instructions. * gcc.target/i386/l_fma_double_3.c: Update count of fma instructions. * gcc.target/i386/l_fma_double_4.c: Update count of fma instructions. * gcc.target/i386/l_fma_double_5.c: Update count of fma instructions. * gcc.target/i386/l_fma_double_6.c: Update count of fma instructions. * gcc.target/i386/l_fma_float_1.c: Update count of fma instructions. * gcc.target/i386/l_fma_float_2.c: Update count of fma instructions. * gcc.target/i386/l_fma_float_3.c: Update count of fma instructions. * gcc.target/i386/l_fma_float_4.c: Update count of fma instructions. * gcc.target/i386/l_fma_float_5.c: Update count of fma instructions. * gcc.target/i386/l_fma_float_6.c: Update count of fma instructions. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/x86-tune-costs.h trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/l_fma_double_1.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_2.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_3.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_4.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_5.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_6.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_1.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_2.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_3.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_4.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_5.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_6.c
Author: hubicka Date: Sat Dec 2 09:22:41 2017 New Revision: 255357 URL: https://gcc.gnu.org/viewcvs?rev=255357&root=gcc&view=rev Log: PR target/81616 * x86-tune.def: Remove obsolette FIXMEs. (X86_TUNE_PARTIAL_FLAG_REG_STALL): Disable for generic (X86_TUNE_FUSE_CMP_AND_BRANCH_32, X86_TUNE_FUSE_CMP_AND_BRANCH_64, X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, X86_TUNE_FUSE_ALU_AND_BRANCH): Enable for generic. (X86_TUNE_PAD_RETURNS): Disable for generic. * gcc.target/i386/pad-1.c: Compile for amdfam10. * gcc.target/i386/align-limit.c: Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/x86-tune.def trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/align-limit.c trunk/gcc/testsuite/gcc.target/i386/pad-1.c
Author: hubicka Date: Mon Dec 4 23:59:11 2017 New Revision: 255395 URL: https://gcc.gnu.org/viewcvs?rev=255395&root=gcc&view=rev Log: PR target/81616 * athlon.md: Disable for generic. * haswell.md: Enable for generic. * i386.c (ix86_sched_init_global): Add core hooks for generic. * x86-tune-sched.c (ix86_issue_rate): Increase issue rate for generic to 4. (ix86_adjust_cost): Move generic to haswell path. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/athlon.md trunk/gcc/config/i386/haswell.md trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/x86-tune-sched.c
Created attachment 42872 [details] Untested fix for harmful FMAs (In reply to Jan Hubicka from comment #25) > Hi, I agree that the matric multiplication fma issue is > important and hopefully it will be fixed for GCC 8. See > https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html I am testing the attached patch to address the FMA generation. I plan to submit it to the mailing list this week if everything goes fine but I would be very grateful for any comments or additional testing/benchmarking. The patch brings the run-time of the matrix.c testcase with native znver1 tuning down to the levels seen with generic tuning, without it I see 60% regressions at both -O2 and -O3. (Even with the patch, using -mprefer-vector-width=256 can still do quite a bit better but at least the difference is now 20% and not 100%).
I have tested it on SKX with SPEC2006INT and SPEC2017INT and don't see any regressions.
(In reply to Sebastian Peryt from comment #39) > I have tested it on SKX with SPEC2006INT and SPEC2017INT and don't see any > regressions. I should have written that the patch only affects znver1 tuning by default, so if you try to see what the effects are on on another platform or with some other tuning, you need to add --param avoid-fma-max-bits=128 or perhaps 256 if that is the preferred vector length with your tuning (or even 512 on the most modern Intel CPUs?) to the command line. It would be interesting to see what the effects of that is on modern Intel CPUs both on SPEC and the matrix.c example. Meanwhile, I have submitted the patch to mailing list: https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01053.html
Author: hubicka Date: Tue Jan 2 09:31:47 2018 New Revision: 256070 URL: https://gcc.gnu.org/viewcvs?rev=256070&root=gcc&view=rev Log: PR target/81616 * x86-tune-costs.h (generic_cost): Reduce cost of FDIV 20->17, cost of sqrt 20->14, DIVSS 18->13, DIVSD 32->17, SQRtSS 30->14 and SQRTsD 58->18, cond_not_taken_branch_cost. 2->1. Increase cond_taken_branch_cost 3->4. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/x86-tune-costs.h
Author: hubicka Date: Tue Jan 2 13:04:19 2018 New Revision: 256073 URL: https://gcc.gnu.org/viewcvs?rev=256073&root=gcc&view=rev Log: PR target/81616 * config/i386/x86-tune-costs.h: Increase cost of integer load costs for generic 4->6. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/x86-tune-costs.h
Author: hubicka Date: Wed Jan 10 11:02:55 2018 New Revision: 256424 URL: https://gcc.gnu.org/viewcvs?rev=256424&root=gcc&view=rev Log: PR target/81616 * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER. * i386.h (TARGET_USE_GATHER): Define. * x86-tune.def (X86_TUNE_USE_GATHER): New. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/x86-tune.def
Author: jamborm Date: Fri Jan 12 14:06:10 2018 New Revision: 256581 URL: https://gcc.gnu.org/viewcvs?rev=256581&root=gcc&view=rev Log: Deferring FMA transformations in tight loops 2018-01-12 Martin Jambor <mjambor@suse.cz> PR target/81616 * params.def: New parameter PARAM_AVOID_FMA_MAX_BITS. * tree-ssa-math-opts.c: Include domwalk.h. (convert_mult_to_fma_1): New function. (fma_transformation_info): New type. (fma_deferring_state): Likewise. (cancel_fma_deferring): New function. (result_of_phi): Likewise. (last_fma_candidate_feeds_initial_phi): Likewise. (convert_mult_to_fma): Added deferring logic, split actual transformation to convert_mult_to_fma_1. (math_opts_dom_walker): New type. (math_opts_dom_walker::after_dom_children): New method, body moved here from pass_optimize_widening_mul::execute, added deferring logic bits. (pass_optimize_widening_mul::execute): Moved most of code to math_opts_dom_walker::after_dom_children. * config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New. * config/i386/i386.c (ix86_option_override_internal): Added maybe_setting of PARAM_AVOID_FMA_MAX_BITS. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/x86-tune.def trunk/gcc/params.def trunk/gcc/tree-ssa-math-opts.c
I believe all issues tracked here has been adressed. Andrew, do you still see some anomalies? Honza
With the latest snapshot: gcc version 8.0.1 20180121 For the mt19937ar things now look reasonable without any strange options on Ryzen. Top 5 mt19937ar took 226849 clocks -march=amdfam10 -mtune=btver2 mt19937ar took 228970 clocks -march=amdfam10 -mtune=barcelona mt19937ar took 229494 clocks -march=bdver1 -mtune=btver1 mt19937ar took 229524 clocks -march=nano -mtune=nano mt19937ar took 230003 clocks -march=opteron-sse3 -mtune=athlon64-sse3 mt19937ar took 233793 clocks -march=k8-sse3 -mtune=x86-64 mt19937ar took 241700 clocks -march=corei7 -mtune=generic mt19937ar took 242373 clocks -march=nano-3000 -mtune=znver1 mt19937ar took 245550 clocks -march=k8-sse3 -mtune=haswell mt19937ar took 251431 clocks -march=znver1 -mtune=generic mt19937ar took 262200 clocks -march=znver1 -mtune=znver1 mt19937ar took 276993 clocks -march=haswell -mtune=haswell Bot 5 mt19937ar took 341326 clocks -march=nano-x4 -mtune=silvermont mt19937ar took 341750 clocks -march=core-avx-i -mtune=nocona mt19937ar took 342457 clocks -march=k8 -mtune=znver1 mt19937ar took 347453 clocks -march=ivybridge -mtune=bonnell mt19937ar took 364041 clocks -march=haswell -mtune=core-avx-i with -mno-avx2 mt19937ar took 235997 clocks -march=znver1 -mtune=opteron mt19937ar took 233921 clocks -march=nano-1000 -mtune=x86-64 mt19937ar took 243452 clocks -march=znver1 -mtune=x86-64 mt19937ar took 243540 clocks -march=silvermont -mtune=generic mt19937ar took 247113 clocks -march=znver1 -mtune=generic mt19937ar took 241368 clocks -march=nano-2000 -mtune=haswell mt19937ar took 247806 clocks -march=znver1 -mtune=znver1 Compare this with it taking 430875 clocks originally for -march=znver1 -mtune=znver1 On Haswell Top 5 mt19937ar took 220000 clocks -march=amdfam10 -mtune=amdfam10 mt19937ar took 220000 clocks -march=amdfam10 -mtune=athlon64 mt19937ar took 220000 clocks -march=amdfam10 -mtune=athlon64-sse3 mt19937ar took 220000 clocks -march=amdfam10 -mtune=athlon-fx mt19937ar took 220000 clocks -march=amdfam10 -mtune=barcelona mt19937ar took 220000 clocks -march=corei7-avx -mtune=x86-64 mt19937ar took 230000 clocks -march=haswell -mtune=haswell mt19937ar took 240000 clocks -march=haswell -mtune=generic mt19937ar took 260000 clocks -march=haswell -mtune=x86-64 Bot 5 (all various shades of mtune=bdverZ or mtune=btverZ) mt19937ar took 310000 clocks -march=core-avx2 -mtune=bdver1 mt19937ar took 310000 clocks -march=haswell -mtune=bdver1 mt19937ar took 310000 clocks -march=skylake -mtune=bdver1
Again with the latest snapshot: gcc version 8.0.1 20180121 matrix.c is still needing additional options to get the best out of the Ryzen processor. But is better than before (223029 clocks vs 371978 originally), but 122677 is achievable with the right options. However the same can also be said for haswell as things stand. The haswell (-march=haswell -mtune=haswell) time has dropped from 190000 to 23000, but do we put that down to Meltdown/Spectre updates or compiler updates. With just -O3 on Ryzen: Top 5 mult took 115669 clocks -march=ivybridge -mtune=skylake-avx512 mult took 118403 clocks -march=corei7-avx -mtune=skylake-avx512 mult took 119379 clocks -march=core-avx-i -mtune=skylake-avx512 mult took 119735 clocks -march=corei7-avx -mtune=skylake mult took 119901 clocks -march=sandybridge -mtune=broadwell mult took 120023 clocks -march=sandybridge -mtune=haswell mult took 121010 clocks -march=corei7-avx -mtune=haswell mult took 127371 clocks -march=sandybridge -mtune=x86-64 mult took 151208 clocks -march=btver2 -mtune=generic mult took 152360 clocks -march=ivybridge -mtune=generic mult took 173926 clocks -march=haswell -mtune=haswell mult took 177359 clocks -march=znver1 -mtune=athlon64 mult took 180000 clocks -march=ivybridge -mtune=znver1 mult took 188219 clocks -march=znver1 -mtune=generic mult took 199721 clocks -march=znver1 -mtune=x86-64 mult took 223029 clocks -march=znver1 -mtune=znver1 Bot 5 mult took 377398 clocks -march=znver1 -mtune=bdver3 mult took 377650 clocks -march=knl -mtune=bdver3 mult took 378600 clocks -march=core-avx2 -mtune=bonnell mult took 381447 clocks -march=skylake-avx512 -mtune=haswell mult took 388837 clocks -march=skylake-avx512 -mtune=bdver4 On Haswell Top 5 mult took 133704 clocks -march=ivybridge -mtune=k8-sse3 mult took 150000 clocks -march=btver2 -mtune=k8 mult took 150000 clocks -march=core-avx-i -mtune=x86-64 mult took 150000 clocks -march=corei7-avx -mtune=nano mult took 150000 clocks -march=corei7-avx -mtune=opteron mult took 160000 clocks -march=core-avx-i -mtune=haswell mult took 190000 clocks -march=haswell -mtune=eden-x4 mult took 190000 clocks -march=ivybridge -mtune=generic mult took 200000 clocks -march=haswell -mtune=x86-64 mult took 230000 clocks -march=haswell -mtune=haswell mult took 270000 clocks -march=haswell -mtune=generic Bot 5 mult took 420000 clocks -march=skylake-avx512 -mtune=bdver2 mult took 420000 clocks -march=znver1 -mtune=bdver3 mult took 420000 clocks -march=znver1 -mtune=bdver4 mult took 430000 clocks -march=bdver2 -mtune=bdver2 mult took 430000 clocks -march=knl -mtune=bdver2 Using -mprefer-vector-width=none -mno-fma -mno-avx2 -O3 On Ryzen Top 5 mult took 116558 clocks -march=haswell -mtune=bdver3 mult took 116673 clocks -march=haswell -mtune=skylake mult took 117268 clocks -march=sandybridge -mtune=skylake-avx512 mult took 117288 clocks -march=broadwell -mtune=nocona mult took 118450 clocks -march=corei7-avx -mtune=haswell mult took 119719 clocks -march=core-avx-i -mtune=znver1 mult took 120028 clocks -march=znver1 -mtune=skylake mult took 122677 clocks -march=znver1 -mtune=znver1 mult took 123423 clocks -march=haswell -mtune=haswell mult took 127388 clocks -march=skylake -mtune=x86-64 mult took 130475 clocks -march=znver1 -mtune=x86-64 mult took 132374 clocks -march=sandybridge -mtune=generic mult took 162317 clocks -march=znver1 -mtune=generic Bot 5 mult took 300000 clocks -march=nano-x2 -mtune=btver2 mult took 310000 clocks -march=skylake-avx512 -mtune=westmere mult took 319772 clocks -march=knl -mtune=sandybridge mult took 320000 clocks -march=eden-x2 -mtune=amdfam10 mult took 330000 clocks -march=atom -mtune=broadwell On Haswell Top 5 mult took 123148 clocks -march=bonnell -mtune=ivybridge mult took 130262 clocks -march=ivybridge -mtune=silvermont mult took 135299 clocks -march=core-avx2 -mtune=nano-3000 mult took 150000 clocks -march=core-avx2 -mtune=intel mult took 150000 clocks -march=haswell -mtune=btver1 mult took 170000 clocks -march=core-avx-i -mtune=haswell mult took 170000 clocks -march=znver1 -mtune=x86-64 mult took 180000 clocks -march=haswell -mtune=haswell mult took 180000 clocks -march=znver1 -mtune=generic mult took 210000 clocks -march=haswell -mtune=generic mult took 230000 clocks -march=haswell -mtune=x86-64 Bot 5 mult took 350000 clocks -march=nano-x4 -mtune=nano-2000 mult took 350000 clocks -march=slm -mtune=skylake-avx512 mult took 360000 clocks -march=barcelona -mtune=broadwell mult took 360000 clocks -march=nano -mtune=corei7 mult took 360000 clocks -march=nocona -mtune=btver2
Correction, that should be 230000 not 23000 for the haswell drop in performance.
> matrix.c is still needing additional options to get the best out of the Ryzen > processor. But is better than before (223029 clocks vs 371978 originally), > but 122677 is achievable with the right options. However the same can also be Aha, for ryzen we would still benefit from 256 vectorization. It is not a win overall and it will need bigger surgey to vectorizer to implement properly, so that will wait for next stage1 unfortunately. This is the gap between -march=znver1 -mtune=generic and -march=znver1, so about 17% Concerning your options -mprefer-vector-width=none -mno-fma -mno-avx2 -O3 With Martin's patch in -mno-fma should no longer have effect here. Not sure why -mno-avx2 would be a win either. We originally introduced it to disable scatter/gather in the other benchmark but that one is solved too. Do those two option still improve the scores for you. It is alaso mystery to me why -march=ivybridge would benefit anything as the isa is more or less supperset of znver. I will try to find more to check more. Honza
with the matrix.c benchmark on Ryzen and looking at the other options when using -march=znver1 and -mtune=znver1 mult took 225281 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128 mult took 185961 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256 mult took 187577 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512 -adding mno-avx2 has no effect on the above baseline. adding in -mno-fma mult took 223302 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128 -mno-fma mult took 123773 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256 -mno-fma mult took 124690 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512 -mno-fma Is the patch in trunk yet? I was assuming it was from the other comments. using -march=ivybridge but keeping the rest of the options: mult took 215052 clocks -march=ivybridge -mtune=znver1 -mprefer-vector-width=128 -mno-fma mult took 121661 clocks -march=ivybridge -mtune=znver1 -mprefer-vector-width=256 -mno-fma mult took 131763 clocks -march=ivybridge -mtune=znver1 -mprefer-vector-width=512 -mno-fma Switching to -march=ivybridge -mtune=skylake-avx512 and dropping the other options (and still on Ryzen) mult took 119195 clocks -march=ivybridge -mtune=skylake-avx512 With -march=znver1 -mtune=skylake-avx512 and dropping the other options mult took 182799 clocks -march=znver1 -mtune=skylake-avx512 So the combination of -march=ivybridge -mtune=skylake-avx512 is doing something right.
(In reply to Andrew Roberts from comment #50) > with the matrix.c benchmark on Ryzen and looking at the other options when > using -march=znver1 and -mtune=znver1 > > mult took 225281 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128 > mult took 185961 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256 > mult took 187577 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512 > > -adding mno-avx2 has no effect on the above baseline. > > adding in -mno-fma > > mult took 223302 clocks -march=znver1 -mtune=znver1 > -mprefer-vector-width=128 -mno-fma > mult took 123773 clocks -march=znver1 -mtune=znver1 > -mprefer-vector-width=256 -mno-fma > mult took 124690 clocks -march=znver1 -mtune=znver1 > -mprefer-vector-width=512 -mno-fma > > Is the patch in trunk yet? I was assuming it was from the other comments. Yes, but by default (on Zen) it only prevents generating FMAs for 128bit operands (or smaller). Originally, AMD kept 256bit ones or larger intact in their splitting patch (and in a conversation they hinted that they might be beneficial in some scenarios) and I kept the condition there because 256bit vectors are not well understood and I had little time. We will definitely look at this whe examining AVX256 on Zen. I am not sure whether want to lift the restriction only based on matrix.c in stage 4. But I would not oppose it.
Fixed? Or shall we take it as recurring bug?
I'd vote for marking this fixed (and asking anyone with other ideas what could be improved in generic tuning to open a new bug).
Yep, I think we could declare this as fixed. The cost tuning seems to work reasonably well for cores and zens.
The master branch has been updated by Hongyu Wang <hongyuw@gcc.gnu.org>: https://gcc.gnu.org/g:3a1a141f79c83ad38f7db3a21d8a4dcfe625c176 commit r13-4534-g3a1a141f79c83ad38f7db3a21d8a4dcfe625c176 Author: Hongyu Wang <hongyu.wang@intel.com> Date: Tue Dec 6 09:53:35 2022 +0800 i386: Avoid fma_chain for -march=alderlake and sapphirerapids. For Alderlake there is similar issue like PR 81616, enable avoid_fma256_chain will also benefit on Intel latest platforms Alderlake and Sapphire Rapids. gcc/ChangeLog: * config/i386/x86-tune.def (X86_TUNE_AVOID_256FMA_CHAINS): Add m_SAPPHIRERAPIDS, m_ALDERLAKE and m_CORE_ATOM.