This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Enabling vectorization at -O2 for x86 generic, core and zen tuning
- From: Jan Hubicka <hubicka at ucw dot cz>
- To: gcc at gcc dot gnu dot org, rguenther at suse dot de, jakub at redhat dot com, mliska at suse dot cz, ubizjak at gmail dot com
- Date: Sun, 6 Jan 2019 16:41:41 +0100
- Subject: Enabling vectorization at -O2 for x86 generic, core and zen tuning
Hello,
while running benchmarks for inliner tuning I also run benchmarks
comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin
Liska's LNT setup (https://lnt.opensuse.org/). The results are
summarized below but you can also see also colorful table produced
by Martin's LNT magic
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3&min_percentage_change=0.02&revisions=746f%2C55f&fbclid=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10&min_percentage_change=0.02&revisions=746f%2C55f
Overall we got following SPECrate improvements:
SPECfp2k6 kabylake generic +7.15%
SPECfp2k6 kabylake native +9.36%
SPECfp2k17 kabylake generic +5.36%
SPECfp2k17 kabylake native +6.03%
SPECint2k17 kabylake generic +4.13%
SPECfp2k6 zen generic +9.98%
SPECfp2k6 zen native +7.04%
SPECfp2k17 zen generic +6.11%
SPECfp2k17 zen native +5.46%
SPECint2k17 zen generic +3.61%
SPECint2k17 zen native +5.18%
The performance results seems surprisingly a lot in favor of
vectorization. Martin's setup is also checking code size which goes up
by as much 26% on leslie 3d, but since many of benchmarks are small,
this is not very representative for overall code size/compile time costs
of vectorization.
I measured compile time/size on larger programs I have available with
notable changes on DealII, but otherwise sub 1% increases. I also
benchmarked Firefox but there are no significant differences because
build system already uses -O3 for places where it matters (graphics
library etc.)
Compile time code segment size
Firefox mainlin in noise 0.8%
gcc from spec2k6 0.5% 0.6%
gdb 0.8% 0.3%
crafty 0% 0%
DealII 3.2% 4%
Note that I benchmarked -ftree-slp-vectorize separately before and
results was hit/miss, so perhaps enabling only -ftree-vectorize would
give better compile time tradeoffs. I was worried of partial memory
stalls, but I will benchmark it and also benchmark difference between
cost models.
There are some performance regressions, most notably in SPEC
- exchange (all settings),
- gamess (all settings),
- calculix (Zen native only),
- bwaves (zen native)
and induct2 on all settings and ffft2 zen only from Polyhedron. Botan
seems very noisy, but it is rather special code.
Exchange can be fixed by adding heuristics that it is bad idea to
vectorize withing loop nest of 10 containing recursive call. I believe
gamess and calculix are understood and i can look into the remaining
cases.
Overall I am surprised how many improvements vectorization at -O2 can do
- clearly more parallel CPUs depends it depends on it. In my experience
from analyzing regressions of gcc -O2 compared to clang -O2 buids,
vectorization is one of most common reasons. Having gcc -O2 producing
lower SPEC scores and comparably large binaries to clang -O2 does not
feel OK and I think the problem is not limited just to artificial
benchmarks.
Even though it is late in release cycle I wonder if we can do that for
GCC 9? Performance of vectorization is very architecture specific, I
would propose enabling vectorization for Zen, core based chips and
generic in x86-64. I can also run benchmarks on buldozer. I can then
tune down the cheap model to avoid some of more expensive
transformations.
Honza
Kabylake Spec2k6, generic tuning
improvements:
SPEC2006/FP/481.wrf -31.33%
SPEC2006/FP/436.cactusADM -28.17%
SPEC2006/FP/437.leslie3d -17.21%
SPEC2006/FP/434.zeusmp -12.90%
SPEC2006/FP/454.calculix -6.44%
SPEC2006/FP/433.milc -6.03%
SPEC2006/FP/459.GemsFDTD -4.65%
SPEC2006/FP/450.soplex -2.11%
SPEC2006/INT/403.gcc -6.54%
SPEC2006/INT/456.hmmer -5.45%
SPEC2006/INT/464.h264ref -2.23%
regresions:
SPEC2006/FP/416.gamess 8.51%
SPEC2006/FP/447.dealII 2.73%
Kabylake spec2k6 -march=native
improvements:
SPEC2006/FP/436.cactusADM -45.52%
SPEC2006/FP/481.wrf -34.13%
SPEC2006/FP/434.zeusmp -20.25%
SPEC2006/FP/437.leslie3d -19.44%
SPEC2006/FP/459.GemsFDTD -6.85%
SPEC2006/FP/433.milc -2.15%
SPEC2006/INT/456.hmmer -8.97%
SPEC2006/INT/403.gcc -7.07%
SPEC2006/INT/464.h264ref -3.00%
regressions:
SPEC2006/FP/416.gamess 7.97%
SPEC2006/INT/483.xalancbmk 3.55%
SPEC2006/INT/400.perlbench 2.61%
Kabylake spec2k17 generic tuning
improvements:
SPEC2017/INT/525.x264_r -33.24%
SPEC2017/FP/521.wrf_r -30.63%
SPEC2017/FP/538.imagick_r -9.16%
SPEC2017/FP/554.roms_r -6.29%
SPEC2017/INT/523.xalancbmk -5.69%
SPEC2017/FP/527.cam4_r -5.19%
SPEC2017/INT/557.xz_r -4.58%
SPEC2017/FP/510.parest_r -4.28%
SPEC2017/FP/549.fotonik3d -2.62%
regressions:
SPEC2017/INT/548.exchange2 12.54%
Kabylake spec2k17 -march=native:
improvements:
SPEC2017/FP/521.wrf_r -37.25%
SPEC2017/INT/525.x264_r -30.31%
SPEC2017/FP/554.roms_r -10.43%
SPEC2017/FP/527.cam4_r -10.05%
SPEC2017/FP/549.fotonik3d -7.82%
SPEC2017/FP/510.parest_r -4.48%
regressions:
SPEC2017/INT/548.exchange2 14.51%
SPEC2017/INT/557.xz_r 3.17%
SPEC2017/FP/519.lbm_r 2.22%
Zen spec2k6 genric tuning
improvements:
SPEC2006/FP/436.cactusADM -39.94%
SPEC2006/FP/481.wrf -33.44%
SPEC2006/FP/437.leslie3d -16.35%
SPEC2006/FP/434.zeusmp -15.83%
SPEC2006/FP/433.milc -13.53%
SPEC2006/FP/454.calculix -9.18%
SPEC2006/INT/456.hmmer -8.22%
SPEC2006/FP/459.GemsFDTD -7.53%
SPEC2006/FP/447.dealII -6.12%
SPEC2006/INT/403.gcc -3.67%
SPEC2006/INT/464.h264ref -2.92%
SPEC2006/INT/401.bzip2 -2.07%
regressions:
SPEC2006/FP/416.gamess 8.06%
SPEC2006/INT/400.perlbench 6.52%
SPEC2006/INT/483.xalancbmk 3.84%
Zen SPEC2k6 -march=native
improvements
SPEC2006/FP/481.wrf -31.55%
SPEC2006/FP/436.cactusADM -29.20%
SPEC2006/FP/437.leslie3d -16.91%
SPEC2006/FP/433.milc -14.39%
SPEC2006/FP/434.zeusmp -10.18%
SPEC2006/INT/456.hmmer -8.95%
SPEC2006/FP/459.GemsFDTD -7.23%
SPEC2006/FP/447.dealII -3.31%
SPEC2006/INT/464.h264ref -3.29%
SPEC2006/FP/470.lbm -2.83%
SPEC2006/INT/403.gcc -2.56%
regressions:
SPEC2006/FP/416.gamess 8.45%
SPEC2006/FP/454.calculix 10.07%
Zen SPEC2k17 generic tuning
improvements:
SPEC2017/INT/525.x264_r -34.06%
SPEC2017/FP/521.wrf_r -29.71%
SPEC2017/FP/538.imagick_r -7.01%
SPEC2017/FP/549.fotonik3d -6.00%
SPEC2017/FP/527.cam4_r -5.95%
SPEC2017/FP/510.parest_r -5.93%
SPEC2017/FP/554.roms_r -5.42%
SPEC2017/FP/503.bwaves_r -4.46%
SPEC2017/FP/511.povray_r -3.76%
SPEC2017/INT/523.xalancbmk -3.10%
SPEC2017/FP/507.cactuBSSN -2.22%
regressions:
SPEC2017/INT/548.exchange2 8.41%
SPEC2017/INT/505.mcf_r 2.05%
Zen SPEC2k17 -march=native
improvements:
SPEC2017/INT/525.x264_r -37.00%
SPEC2017/FP/521.wrf_r -28.70%
SPEC2017/FP/538.imagick_r -17.91%
SPEC2017/FP/510.parest_r -7.25%
SPEC2017/FP/527.cam4_r -5.52%
SPEC2017/FP/554.roms_r -5.10%
SPEC2017/INT/523.xalancbmk -3.82%
SPEC2017/FP/549.fotonik3d -2.52%
SPEC2017/FP/507.cactuBSSN -2.16%
SPEC2017/INT/502.gcc -2.12%
regressions:
SPEC2017/INT/548.exchange2 9.80%
SPEC2017/FP/503.bwaves_r 7.81%
SPEC2017/INT/531.deepsjeng 2.16%
Kabylake Polyhedron generic
improvements:
tfft2 -23.05%
test_fpu2 -18.89%
gas_dyn2 -13.55%
linpk -7.77%
rnflow -2.52%
nf -2.24%
regressions:
air 3.76%
induct2 216.41%
Zen Polyhedron generic
improvements:
gas_dyn2 -36.10%
test_fpu2 -20.97%
linpk -6.29%
channel2 -5.04%
fatigue2 -3.43%
nf -3.07%
capacita -2.30%
regressions:
induct2 231.04%
tfft2 34.25%
protein 4.81%
Kabylake C++ benchmarks generic
improvements:
nbench/NEURAL NET 34.01%
botan/CMAC(AES-128) mac 21.62%
botan/AES-128/CBC/PKCS7 enc 21.25%
botan/AES-128/CBC/PKCS7 dec 18.43%
nbench/LU DECOMPOSITION 13.42%
botan/AES-128/EAX encrypt 10.93%
botan/AES-128/EAX decrypt 10.50%
botan/AES-128/OCB encrypt 9.84%
botan/AES-128/OCB decrypt 9.29%
nbench/ASSIGNMENT 6.15%
botan/AES-128/XTS decrypt 3.74%
botan/AES-128/XTS encrypt 3.64%
botan/CTR-BE(AES-128) encr 2.61%
botan/CTR-BE(AES-128) decr 2.56%
botan/AES-128/GCM(16) enct 2.52%
botan/AES-128/GCM(16) decr 2.01%
regressions:
botan/Whirlpool hash -11.35%
nbench/HUFFMAN -2.31%
botan/Keccak-1600(512) hash -3.61%
botan/Tiger(24,3) hash -2.94%
Zenith C++ benchmarks generic
improvements:
nbench/NEURAL NET 47.78%
botan/AES-128/CBC/PKCS7 encr 21.07%
botan/CMAC(AES-128) mac 19.97%
botan/CTR-BE(AES-128) encr 15.21%
botan/CTR-BE(AES-128) decr 14.24%
botan/AES-128/EAX encrypt 13.46%
botan/AES-128/EAX decrypt 12.84%
nbench/LU DECOMPOSITION 9.12%
botan/AES-128/GCM(16) encr 5.66%
botan/AES-128/GCM(16) decr 4.40%
botan/AES-128/CBC/PKCS7 decr 2.96%
botan/ChaCha20Poly1305 decr 2.67%
botan/AES-128/XTS encrypt 2.53%
botan/Salsa20 encrypt 2.33%
botan/Skein-512(512) hash 2.22%
botan/ChaCha20Poly1305 encr 2.14%
regressions:
nbench/HUFFMAN -12.51%
botan/Whirlpool hash -8.26%
botan/Camellia-192 encrypt -7.12%
botan/Camellia-256 decrypt -7.07%
botan/Camellia-192 decrypt -6.82%
botan/Camellia-128 decrypt -6.73%
botan/Camellia-256 encrypt -6.59%
botan/AES-128/XTS decrypt -6.31%
botan/Camellia-128 encrypt -6.30%
botan/XTEA decrypt -4.87%
nbench/ASSIGNMENT -4.85%
botan/AES-128/OCB encrypt -3.36%
botan/Keccak-1600(512) hash -3.08%
botan/AES-128 decrypt -2.52%
botan/SHA-160 hash -2.31%
Binary sizes and other stats are in the aforementioned links.