This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [AArch64] Missed vectorization opportunity in cactusADM
- From: James Greenhalgh <james dot greenhalgh at arm dot com>
- To: "Ekanathan, Saravanan" <Saravanan dot Ekanathan at amd dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>
- Date: Wed, 8 Apr 2015 16:14:21 +0100
- Subject: Re: [AArch64] Missed vectorization opportunity in cactusADM
- Authentication-results: sourceware.org; auth=none
- References: <8CCF65FBE54B884C85A7D47EBF0834C70927E795 at storexdag02 dot amd dot com>
On Thu, Apr 02, 2015 at 04:20:06AM +0100, Ekanathan, Saravanan wrote:
> (I had sent this mail to gcc-help a week ago. Not sure, all GCC developers
> are subscribed to gcc-help, so re-sending to GCC development mailing list)
>
> Hi,
>
> This looks like a missed vectorization opportunity for one of the 'Fortran'
> hot loops in cactusADM (CPU2006 benchmark) when compiled with
> "-mcpu=cortex-a57 -Ofast". Interestingly, the 'generic' model (compiled with
> plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop,
> hence there is good runtime performance improvement noticed on native Aarch64
> platform.
>
> I don't have a small reproducible testcase, hence quoting cactusADM benchmark
> here. The hot loop is present in Bench_StaggeredLeapfrog2() in
> StaggeredLeapfrog2.F file.
>
> For cortex-a57, vectorization report clearly mentions that scalar cost <
> vector_cost/vectorization_factor, hence didn't vectorize.
>
> For generic case, due to un-tuned vector cost model, the scalar cost >
> vector_cost/vectorization_factor (since scalar_cost = vector_cost), so the
> loop got vectorized
>
> << Output of generic vectorized case>>
> StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: note: LOOP VECTORIZED
>
> I have also played around with cortexa57_vector_cost table(esp.,
> scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost etc..,), which
> influences the vectorization decision in this case. The
> cortexa57_vector_cost table directly maps to the cost mentioned in
> "Cortex-A57 Software Optimisation Guide". But, it looks like there is
> further scope of tuning the cortexa57 vector cost to vectorize such cases.
>
> Any comments on this missed opportunity ?
When I added the vector costs for Cortex-A57, I followed the Cortex-A57
Software Optimisation Guide [1] you mentioned above. I took a lower-bound
estimate for each cost, which will certainly underestimate the
floating-point scalar costs.
So, I can believe that the costs will not be optimal for all test code
you can give them, and I'm happy to look at patches which improve the
vector costs. If you are planning to look at this, please feel free to
raise a bugzilla issue and assign it to yourself so we can track things.
Please be sure to test any changes across a range of workloads - from
time to time I've seen issues with the Cortex-A57 vector costs where
we have been too eager to vectorize and would have been better keeping
to scalar code.
Thanks,
James
---
[1]: Cortex-57 Software Optimisation Guide
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.uan0015a/index.html