Why vectorization didn't turn on by -O2

Kewen.Lin linkw@linux.ibm.com
Mon Aug 16 03:22:33 GMT 2021


on 2021/8/4 下午4:31, Richard Biener wrote:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
> 
>> Hongtao Liu <crazylht@gmail.com> writes:
>>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>>> <gcc-help@gcc.gnu.org> wrote:
>>>>
>>>> Jan Hubicka <hubicka@ucw.cz> writes:
>>>>> Hi,
>>>>> here are updated scores.
>>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>> compares
>>>>>   base:  mainline
>>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
>>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
>>>>>
>>>>> The short story is:
>>>>>
>>>>> 1) -O2 generic performance
>>>>>     kabylake (Intel):
>>>>>                               very    cheap
>>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
>>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
>>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
>>>>>
>>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
>>>>>       SPEC/SPEC2017/total     2.22%   6.52%
>>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
>>>>>     zen
>>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
>>>>>       SPEC/SPEC2006/total     0.26%   6.27%
>>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
>>>>>
>>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
>>>>>       SPEC/SPEC2017/total     3.02%   6.55%
>>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
>>>>>
>>>>>  2) -O2 size:
>>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>>>>>  3) build times:
>>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>>>>>     here I simply copied data from different configuratoins
>>>>>
>>>>> So for SPEC i would say that most of compile time costs are derrived
>>>>> from code size growth which is a problem with cheap model but not with
>>>>> very cheap.  Very cheap indeed results in code size improvements and
>>>>> compile time impact is probably somewhere around 0.5%
>>>>>
>>>>> So from these scores alone this would seem that vectorization makes
>>>>> sense at -O2 with very cheap model to me (I am sure we have other
>>>>> optimizations with worse benefits to compile time tradeoffs).
>>>>
>>>> Thanks for running these.
>>>>
>>>> The biggest issue I know of for enabling very-cheap at -O2 is:
>>>>
>>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>>>>
>>>> Perhaps we could get around that by (hopefully temporarily) disabling
>>>> BB SLP within loop vectorisation for the very-cheap model.  This would
>>>> purely be a workaround and we should remove it once the PR is fixed.
>>>> (It would even be a compile-time win in the meantime :-))
>>>>
>>>> Thanks,
>>>> Richard
>>>>
>>>>> However there are usual arguments against:
>>>>>
>>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>>>>>      that argument is to enable it by default :)
>>>>>   2) Workloads improved are more of -Ofast type workloads
>>>>>
>>>>> Here are non-spec benchmarks we track:
>>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>>
>>>>> I also tried to run Firefox some time ago. Results are not surprising -
>>>>> vectorizaiton helps rendering benchmarks which are those compiler with
>>>>> aggressive flags anyway.
>>>>>
>>>>> Honza
>>>
>>> Hi:
>>>   I would like to ask if we can turn on O2 vectorization now?
>>
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>>
>>       /* If we applied if-conversion then try to vectorize the
>> 	 BB of innermost loops.
>> 	 ???  Ideally BB vectorization would learn to vectorize
>> 	 control flow by applying if-conversion on-the-fly, the
>> 	 following retains the if-converted loop body even when
>> 	 only non-if-converted parts took part in BB vectorization.  */
>>       if (flag_tree_slp_vectorize != 0
>> 	  && loop_vectorized_call
>> 	  && ! loop->inner)
>>
>> for the very-cheap vector cost model until the PR is fixed properly.
> 
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
> 
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> 


Here is the measured performance speedup at O2 vect with
very cheap cost model on both Power8 and Power9.

INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
FP: INT + -ffast-math

Column titles are:

<bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)

Power8:
500.perlbench_r 	0.00% 0.00% 0.00%
502.gcc_r 		0.39% 0.78% 0.00%
505.mcf_r 		0.00% 0.00% 0.00%
520.omnetpp_r 		1.21% 0.30% 0.00%
523.xalancbmk_r 	0.00% 0.00% -0.57%
525.x264_r  		41.84%  42.55%  0.00%
531.deepsjeng_r 	0.00% -0.63%  0.00%
541.leela_r 		-3.44%  -2.75%  0.00%
548.exchange2_r 	1.66% 1.66% 0.00%
557.xz_r  		1.39% 1.04% 0.00%
Geomean 		3.67% 3.64% -0.06%

503.bwaves_r  		0.00% 0.00% 0.00%
507.cactuBSSN_r 	0.00% 0.29% 0.44%
508.namd_r  		0.00% 0.29% 0.00%
510.parest_r  		0.00% -0.36%  -0.54%
511.povray_r  		0.63% 0.31% 0.94%
519.lbm_r 		2.71% 2.71% 0.00%
521.wrf_r 		1.04% 1.04% 0.00%
526.blender_r 		-1.31%  -0.78%  0.00%
527.cam4_r  		-0.62%  -0.31%  -0.62%
538.imagick_r 		0.21% 0.21% -0.21%
544.nab_r 		0.00% 0.00% 0.00%
549.fotonik3d_r 	0.00% 0.00% 0.00%
554.roms_r  		0.30% 0.00% 0.00%
Geomean 		0.22% 0.26% 0.00%

Power9:

500.perlbench_r 	0.62% 0.62% -1.54%
502.gcc_r 		-0.60%  -0.60%  -0.81%
505.mcf_r 		2.05% 2.05% 0.00%
520.omnetpp_r 		-2.41%  -0.30%  -0.60%
523.xalancbmk_r 	-1.44%  -2.30%  -1.44%
525.x264_r  		24.26%  23.93%  -0.33%
531.deepsjeng_r 	0.32% 0.32% 0.00%
541.leela_r 		0.39% 1.18% -0.39%
548.exchange2_r 	0.76% 0.76% 0.00%
557.xz_r  		0.36% 0.36% -0.36%
Geomean 		2.19% 2.38% -0.55%

503.bwaves_r  		0.00% 0.36% 0.00%
507.cactuBSSN_r 	0.00% 0.00% 0.00%
508.namd_r  		-3.73%  -0.31%  -3.73%
510.parest_r  		-0.21%  -0.42%  -0.42%
511.povray_r  		-0.96%  -1.59%  0.64%
519.lbm_r 		2.31% 2.31% 0.17%
521.wrf_r 		2.66% 2.66% 0.00%
526.blender_r 		-1.96%  -1.68%  1.40%
527.cam4_r  		0.00% 0.91% 1.81%
538.imagick_r 		0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
544.nab_r 		0.25% 0.00% 0.00%
549.fotonik3d_r 	0.94% 0.94% 0.00%
554.roms_r  		0.00% 0.00% -1.05%
Geomean 		-0.03%  0.22% -0.93%


As above, the gains are mainly from loop vectorization.
btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.

BR,
Kewen


More information about the Gcc-help mailing list