This is the mail archive of the
mailing list for the GCC project.
[AArch64] Missed vectorization opportunity in cactusADM
- From: "Ekanathan, Saravanan" <Saravanan dot Ekanathan at amd dot com>
- To: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Cc: Marcus Shawcroft <marcus dot shawcroft at arm dot com>, James Greenhalgh <james dot greenhalgh at arm dot com>
- Date: Thu, 2 Apr 2015 03:20:06 +0000
- Subject: [AArch64] Missed vectorization opportunity in cactusADM
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is 22.214.171.124) smtp dot mailfrom=Saravanan dot Ekanathan at amd dot com; arm.com; dkim=none (message not signed) header.d=none;
(I had sent this mail to gcc-help a week ago. Not sure, all GCC developers are subscribed to gcc-help, so re-sending to GCC development mailing list)
This looks like a missed vectorization opportunity for one of the 'Fortran' hot loops in cactusADM (CPU2006 benchmark) when compiled with "-mcpu=cortex-a57 -Ofast".
Interestingly, the 'generic' model (compiled with plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop, hence there is good runtime performance improvement noticed on native Aarch64 platform.
I don't have a small reproducible testcase, hence quoting cactusADM benchmark here.
The hot loop is present in Bench_StaggeredLeapfrog2() in StaggeredLeapfrog2.F file.
For cortex-a57, vectorization report clearly mentions that scalar cost < vector_cost/vectorization_factor, hence didn't vectorize.
For generic case, due to un-tuned vector cost model, the scalar cost > vector_cost/vectorization_factor (since scalar_cost = vector_cost), so the loop got vectorized
<< Output of generic vectorized case>> StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: note: LOOP VECTORIZED
I have also played around with cortexa57_vector_cost table(esp., scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost etc..,), which influences the vectorization decision in this case.
The cortexa57_vector_cost table directly maps to the cost mentioned in "Cortex(r)-A57 Software Optimisation Guide".
But, it looks like there is further scope of tuning the cortexa57 vector cost to vectorize such cases.
Any comments on this missed opportunity ?
PS. I am not pasting the hot loop here, as there could be a license issue of using SPEC CPU2006 sources