This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Hongtao Liu <crazylht at gmail dot com>
- Cc: "H. J. Lu" <hjl dot tools at gmail dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Uros Bizjak <ubizjak at gmail dot com>
- Date: Tue, 12 Nov 2019 09:19:20 +0100
- Subject: Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.
- References: <CAMZc-byz4N3PUqAk0RqZU+=DEJhYw_curYd1JDn_dNjun5xskw@mail.gmail.com>
On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
> This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> all AVX target because we found there's still performance gap between
> 128-bit auto-vectorization and 256-bit auto-vectorization even with
> epilog vectorized.
> The performance influence of setting avx128_optimal as default on
> SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> CLX is as bellow:
>
> INT rate
> 500.perlbench_r -0.32%
> 502.gcc_r -1.32%
> 505.mcf_r -0.12%
> 520.omnetpp_r -0.34%
> 523.xalancbmk_r -0.65%
> 525.x264_r 2.23%
> 531.deepsjeng_r 0.81%
> 541.leela_r -0.02%
> 548.exchange2_r 10.89% ----------> big improvement
> 557.xz_r 0.38%
> geomean for intrate 1.10%
>
> FP rate
> 503.bwaves_r 1.41%
> 507.cactuBSSN_r -0.14%
> 508.namd_r 1.54%
> 510.parest_r -0.87%
> 511.povray_r 0.28%
> 519.lbm_r 0.32%
> 521.wrf_r -0.54%
> 526.blender_r 0.59%
> 527.cam4_r -2.70%
> 538.imagick_r 3.92%
> 544.nab_r 0.59%
> 549.fotonik3d_r -5.44% -------------> regression
> 554.roms_r -2.34%
> geomean for fprate -0.28%
>
> The 10% improvement of 548.exchange_r is because there is 9-layer
> nested loop, and the loop count for innermost layer is small(enough
> for 128-bit vectorization, but not for 256-bit vectorization).
> Since loop count is not statically analyzed out, vectorizer will
> choose 256-bit vectorization which would never never be triggered. The
> vectorization of epilog will introduced some extra instructions,
> normally it will bring back some performance, but since it's 9-layer
> nested loop, costs of extra instructions will cover the gain.
>
> The 5.44% regression of 549.fotonik3d_r is because 256-bit
> vectorization is better than 128-bit vectorization. Generally when
> enabling 256-bit or 512-bit vectorization, there will be instruction
> clocksticks reduction also with frequency reduction. when frequency
> reduction is less than instructions clocksticks reduction, long vector
> width vectorization would be better than shorter one, otherwise the
> opposite. The regression of 549.fotonik3d_r is due to this, similar
> for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> vectorization is best.
>
> Bootstrap and regression test on i386 is ok.
> Ok for trunk?
I don't think 128_optimal does what you think it does. If you want to
prefer 128bit AVX adjust the preference, but 128_optimal describes
a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
and is _not_ intended for "tuning".
Richard.
> Changelog
> gcc/
> * config/i386/i386-option.c (m_CORE_AVX): New macro.
> * config/i386/x86-tune.def: Enable 128_optimal for avx and
> replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
> * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
> * testsuite/gcc.target/i386/pr84413-2.c: Ditto.
> * testsuite/gcc.target/i386/pr84413-3.c: Ditto.
> * testsuite/gcc.target/i386/pr70021.c: Ditto.
> * testsuite/gcc.target/i386/pr90579.c: New test.
>
>
> --
> BR,
> Hongtao