This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Enabling vectorization at -O2 for x86 generic, core and zen tuning

From: Jan Hubicka <hubicka at ucw dot cz>
To: gcc at gcc dot gnu dot org, rguenther at suse dot de, jakub at redhat dot com, mliska at suse dot cz, ubizjak at gmail dot com
Date: Sun, 6 Jan 2019 16:41:41 +0100
Subject: Enabling vectorization at -O2 for x86 generic, core and zen tuning
Hello,
while running benchmarks for inliner tuning I also run benchmarks
comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin
Liska's LNT setup (https://lnt.opensuse.org/).  The results are
summarized below but you can also see also colorful table produced
by Martin's LNT magic

https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3&min_percentage_change=0.02&revisions=746f%2C55f&fbclid=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10&min_percentage_change=0.02&revisions=746f%2C55f

Overall we got following SPECrate improvements:

 SPECfp2k6   kabylake generic  +7.15%
 SPECfp2k6   kabylake native   +9.36%
 SPECfp2k17  kabylake generic  +5.36%
 SPECfp2k17  kabylake native   +6.03%
 SPECint2k17 kabylake generic  +4.13%

 SPECfp2k6   zen      generic  +9.98%
 SPECfp2k6   zen      native   +7.04%
 SPECfp2k17  zen      generic  +6.11%
 SPECfp2k17  zen      native   +5.46%
 SPECint2k17 zen      generic  +3.61%
 SPECint2k17 zen      native   +5.18%

The performance results seems surprisingly a lot in favor of
vectorization.  Martin's setup is also checking code size which goes up
by as much 26% on leslie 3d, but since many of benchmarks are small,
this is not very representative for overall code size/compile time costs
of vectorization.

I measured compile time/size on larger programs I have available with
notable changes on DealII, but otherwise sub 1% increases.  I also
benchmarked Firefox but there are no significant differences because
build system already uses -O3 for places where it matters (graphics
library etc.)

                   Compile time    code segment size 
Firefox	mainlin	      in noise     0.8%
gcc from spec2k6	0.5%	   0.6%
gdb			0.8%	   0.3%
crafty		        0%         0%
DealII			3.2%	   4%

Note that I benchmarked -ftree-slp-vectorize separately before and
results was hit/miss, so perhaps enabling only -ftree-vectorize would
give better compile time tradeoffs. I was worried of partial memory
stalls, but I will benchmark it and also benchmark difference between
cost models.

There are some performance regressions, most notably in SPEC
 - exchange (all settings),
 - gamess (all settings),
 - calculix (Zen native only),
 - bwaves (zen native) 
and induct2 on all settings and ffft2 zen only from Polyhedron. Botan
seems very noisy, but it is rather special code.

Exchange can be fixed by adding heuristics that it is bad idea to
vectorize withing loop nest of 10 containing recursive call. I believe
gamess and calculix are understood and i can look into the remaining
cases.

Overall I am surprised how many improvements vectorization at -O2 can do
- clearly more parallel CPUs depends it depends on it.  In my experience
from analyzing regressions of gcc -O2 compared to clang -O2 buids,
vectorization is one of most common reasons. Having gcc -O2 producing
lower SPEC scores and comparably large binaries to clang -O2 does not
feel OK and I think the problem is not limited just to artificial
benchmarks.

Even though it is late in release cycle I wonder if we can do that for
GCC 9?  Performance of vectorization is very architecture specific, I
would propose enabling vectorization for Zen, core based chips and
generic in x86-64. I can also run benchmarks on buldozer. I can then
tune down the cheap model to avoid some of more expensive
transformations.

Honza


Kabylake Spec2k6, generic tuning

  improvements:
    SPEC2006/FP/481.wrf 		-31.33% 	
    SPEC2006/FP/436.cactusADM 		-28.17% 	
    SPEC2006/FP/437.leslie3d 		-17.21% 	
    SPEC2006/FP/434.zeusmp 		-12.90% 	
    SPEC2006/FP/454.calculix 		-6.44% 	
    SPEC2006/FP/433.milc 		-6.03% 	
    SPEC2006/FP/459.GemsFDTD 		-4.65% 	
    SPEC2006/FP/450.soplex 		-2.11% 	
    SPEC2006/INT/403.gcc 		-6.54% 	
    SPEC2006/INT/456.hmmer 		-5.45% 	
    SPEC2006/INT/464.h264ref 		-2.23% 	
  regresions:
    SPEC2006/FP/416.gamess 		8.51% 	
    SPEC2006/FP/447.dealII 		2.73% 	

Kabylake spec2k6 -march=native

  improvements:
    SPEC2006/FP/436.cactusADM 	 	-45.52% 	
    SPEC2006/FP/481.wrf 	 	-34.13% 	
    SPEC2006/FP/434.zeusmp 	 	-20.25% 	
    SPEC2006/FP/437.leslie3d 	 	-19.44% 	
    SPEC2006/FP/459.GemsFDTD 	 	-6.85% 	
    SPEC2006/FP/433.milc 	 	-2.15% 	
    SPEC2006/INT/456.hmmer 	 	-8.97% 	
    SPEC2006/INT/403.gcc 	 	-7.07% 	
    SPEC2006/INT/464.h264ref 	 	-3.00% 	
  regressions:
    SPEC2006/FP/416.gamess 	 	7.97% 	
    SPEC2006/INT/483.xalancbmk	 	3.55% 	
    SPEC2006/INT/400.perlbench	 	2.61% 	

Kabylake spec2k17 generic tuning

  improvements:
    SPEC2017/INT/525.x264_r 	 	-33.24% 	
    SPEC2017/FP/521.wrf_r 	 	-30.63% 	
    SPEC2017/FP/538.imagick_r 	 	-9.16% 	
    SPEC2017/FP/554.roms_r 	 	-6.29% 	
    SPEC2017/INT/523.xalancbmk	 	-5.69% 	
    SPEC2017/FP/527.cam4_r 	 	-5.19% 	
    SPEC2017/INT/557.xz_r 	 	-4.58% 	
    SPEC2017/FP/510.parest_r 	 	-4.28% 	
    SPEC2017/FP/549.fotonik3d	 	-2.62% 	
  regressions:
    SPEC2017/INT/548.exchange2	 	12.54% 	

Kabylake spec2k17 -march=native:

  improvements:
    SPEC2017/FP/521.wrf_r 	 	-37.25% 	
    SPEC2017/INT/525.x264_r 	 	-30.31% 	
    SPEC2017/FP/554.roms_r 	 	-10.43% 	
    SPEC2017/FP/527.cam4_r 	 	-10.05% 	
    SPEC2017/FP/549.fotonik3d	 	-7.82% 	
    SPEC2017/FP/510.parest_r 	 	-4.48% 	
  regressions:
    SPEC2017/INT/548.exchange2	 	14.51% 	
    SPEC2017/INT/557.xz_r 	 	3.17% 	
    SPEC2017/FP/519.lbm_r 	 	2.22% 	

Zen spec2k6 genric tuning

  improvements:
    SPEC2006/FP/436.cactusADM 		-39.94% 	
    SPEC2006/FP/481.wrf 		-33.44% 	
    SPEC2006/FP/437.leslie3d 		-16.35% 	
    SPEC2006/FP/434.zeusmp 		-15.83% 	
    SPEC2006/FP/433.milc 		-13.53% 	
    SPEC2006/FP/454.calculix 		-9.18% 	
    SPEC2006/INT/456.hmmer 		-8.22% 	
    SPEC2006/FP/459.GemsFDTD 		-7.53% 	
    SPEC2006/FP/447.dealII 		-6.12% 	
    SPEC2006/INT/403.gcc 		-3.67% 	
    SPEC2006/INT/464.h264ref 		-2.92% 	
    SPEC2006/INT/401.bzip2 		-2.07% 	
  regressions:
    SPEC2006/FP/416.gamess 		8.06% 	
    SPEC2006/INT/400.perlbench		6.52% 	
    SPEC2006/INT/483.xalancbmk		3.84% 	

Zen SPEC2k6 -march=native

  improvements
    SPEC2006/FP/481.wrf 		-31.55% 	
    SPEC2006/FP/436.cactusADM 		-29.20% 	
    SPEC2006/FP/437.leslie3d 		-16.91% 	
    SPEC2006/FP/433.milc 		-14.39% 	
    SPEC2006/FP/434.zeusmp 		-10.18% 	
    SPEC2006/INT/456.hmmer 		-8.95% 	
    SPEC2006/FP/459.GemsFDTD 		-7.23% 	
    SPEC2006/FP/447.dealII 		-3.31% 	
    SPEC2006/INT/464.h264ref 		-3.29% 	
    SPEC2006/FP/470.lbm 		-2.83% 	
    SPEC2006/INT/403.gcc 		-2.56% 	
  regressions:
    SPEC2006/FP/416.gamess 		8.45% 	
    SPEC2006/FP/454.calculix 		10.07% 	

Zen SPEC2k17 generic tuning
  improvements:
    SPEC2017/INT/525.x264_r 		-34.06% 	
    SPEC2017/FP/521.wrf_r 		-29.71% 	
    SPEC2017/FP/538.imagick_r 		-7.01% 	
    SPEC2017/FP/549.fotonik3d 		-6.00% 	
    SPEC2017/FP/527.cam4_r 		-5.95% 	
    SPEC2017/FP/510.parest_r 		-5.93% 	
    SPEC2017/FP/554.roms_r 		-5.42% 	
    SPEC2017/FP/503.bwaves_r 		-4.46% 	
    SPEC2017/FP/511.povray_r 		-3.76% 	
    SPEC2017/INT/523.xalancbmk		-3.10% 	
    SPEC2017/FP/507.cactuBSSN 		-2.22% 	
  regressions:
    SPEC2017/INT/548.exchange2 		8.41% 	
    SPEC2017/INT/505.mcf_r 		2.05% 	

Zen SPEC2k17 -march=native
  improvements:
    SPEC2017/INT/525.x264_r 		-37.00% 	
    SPEC2017/FP/521.wrf_r 		-28.70% 	
    SPEC2017/FP/538.imagick_r 		-17.91% 	
    SPEC2017/FP/510.parest_r 		-7.25% 	
    SPEC2017/FP/527.cam4_r 		-5.52% 	
    SPEC2017/FP/554.roms_r 		-5.10% 	
    SPEC2017/INT/523.xalancbmk 		-3.82% 	
    SPEC2017/FP/549.fotonik3d 		-2.52% 	
    SPEC2017/FP/507.cactuBSSN 		-2.16% 	
    SPEC2017/INT/502.gcc 		-2.12% 	
  regressions:
    SPEC2017/INT/548.exchange2 		9.80% 	
    SPEC2017/FP/503.bwaves_r 		7.81% 	
    SPEC2017/INT/531.deepsjeng 		2.16% 	


Kabylake Polyhedron generic

  improvements:
    tfft2 	-23.05% 	
    test_fpu2 	-18.89% 	
    gas_dyn2 	-13.55% 	
    linpk 	-7.77% 	
    rnflow 	-2.52% 	
    nf 		-2.24% 	
  regressions:
    air 	3.76% 
    induct2 	216.41%

Zen Polyhedron generic

  improvements:
    gas_dyn2 	 	-36.10% 	
    test_fpu2 	 	-20.97% 	
    linpk 		-6.29% 	
    channel2 	 	-5.04% 	
    fatigue2 	 	-3.43% 	
    nf 			-3.07% 	
    capacita 	 	-2.30% 	
  regressions:
    induct2 	 	231.04% 	
    tfft2 	 	34.25% 	
    protein 	 	4.81% 	

Kabylake C++ benchmarks generic

  improvements:
    nbench/NEURAL NET 			34.01% 	
    botan/CMAC(AES-128) mac 		21.62% 	
    botan/AES-128/CBC/PKCS7 enc		21.25% 	
    botan/AES-128/CBC/PKCS7 dec	 	18.43% 	
    nbench/LU DECOMPOSITION 	 	13.42% 	
    botan/AES-128/EAX encrypt 		10.93% 	
    botan/AES-128/EAX decrypt 		10.50% 	
    botan/AES-128/OCB encrypt 		9.84% 	
    botan/AES-128/OCB decrypt 		9.29% 	
    nbench/ASSIGNMENT 			6.15% 	
    botan/AES-128/XTS decrypt 	 	3.74% 	
    botan/AES-128/XTS encrypt 	 	3.64% 	
    botan/CTR-BE(AES-128) encr 	 	2.61% 	
    botan/CTR-BE(AES-128) decr 	 	2.56% 	
    botan/AES-128/GCM(16) enct 	 	2.52% 	
    botan/AES-128/GCM(16) decr	 	2.01% 	
  regressions:
    botan/Whirlpool hash 		-11.35% 	
    nbench/HUFFMAN    			-2.31% 	
    botan/Keccak-1600(512) hash		-3.61% 	
    botan/Tiger(24,3) hash 		-2.94% 	

Zenith C++ benchmarks generic

  improvements:
    nbench/NEURAL NET 		       47.78% 	
    botan/AES-128/CBC/PKCS7 encr       21.07% 	
    botan/CMAC(AES-128) mac 	       19.97% 	
    botan/CTR-BE(AES-128) encr 		15.21% 	
    botan/CTR-BE(AES-128) decr 		14.24% 	
    botan/AES-128/EAX encrypt 	       13.46% 	
    botan/AES-128/EAX decrypt 	       12.84% 	
    nbench/LU DECOMPOSITION 		9.12% 	
    botan/AES-128/GCM(16) encr 		5.66% 	
    botan/AES-128/GCM(16) decr 		4.40% 	
    botan/AES-128/CBC/PKCS7 decr	2.96% 	
    botan/ChaCha20Poly1305 decr	       2.67% 	
    botan/AES-128/XTS encrypt 		2.53% 	
    botan/Salsa20 encrypt 	       2.33% 	
    botan/Skein-512(512) hash 	       2.22% 	
    botan/ChaCha20Poly1305 encr	       2.14% 	
 regressions:
    nbench/HUFFMAN 			-12.51% 	
    botan/Whirlpool hash 	       -8.26% 	
    botan/Camellia-192 encrypt 	       -7.12% 	
    botan/Camellia-256 decrypt 	       -7.07% 	
    botan/Camellia-192 decrypt 	       -6.82% 	
    botan/Camellia-128 decrypt 	       -6.73% 	
    botan/Camellia-256 encrypt 	       -6.59% 	
    botan/AES-128/XTS decrypt 		-6.31% 	
    botan/Camellia-128 encrypt 	       -6.30% 	
    botan/XTEA decrypt 		       -4.87% 	
    nbench/ASSIGNMENT 		       -4.85% 	
    botan/AES-128/OCB encrypt 	       -3.36% 	
    botan/Keccak-1600(512) hash        -3.08% 	
    botan/AES-128 decrypt 		-2.52% 	
    botan/SHA-160 hash 			-2.31% 	

Binary sizes and other stats are in the aforementioned links.
Follow-Ups:
- Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning
  - From: Richard Biener
- Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning
  - From: Eric Botcazou
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]