This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [AArch64] Missed vectorization opportunity in cactusADM

On Thu, Apr 02, 2015 at 04:20:06AM +0100, Ekanathan, Saravanan wrote:
> (I had sent this mail to gcc-help a week ago. Not sure, all GCC developers
> are subscribed to gcc-help, so re-sending to GCC development mailing list)
> Hi,
> This looks like a missed vectorization opportunity for one of the 'Fortran'
> hot loops in cactusADM (CPU2006 benchmark) when compiled with
> "-mcpu=cortex-a57 -Ofast".  Interestingly, the 'generic' model (compiled with
> plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop,
> hence there is good runtime performance improvement noticed on native Aarch64
> platform.
> I don't have a small reproducible testcase, hence quoting cactusADM benchmark
> here.  The hot loop is present in Bench_StaggeredLeapfrog2() in
> StaggeredLeapfrog2.F file.
> For cortex-a57, vectorization report clearly mentions that scalar cost <
> vector_cost/vectorization_factor, hence didn't vectorize.
> For generic case, due to un-tuned vector cost model, the scalar cost >
> vector_cost/vectorization_factor  (since scalar_cost = vector_cost), so the
> loop got vectorized
>    << Output of  generic vectorized case>>
>      StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: note: LOOP VECTORIZED
> I have also played around with cortexa57_vector_cost table(esp.,
> scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost  etc..,), which
> influences the vectorization decision in this case.  The
> cortexa57_vector_cost table directly maps to the cost mentioned in
> "Cortex-A57 Software Optimisation Guide".  But, it looks like there is
> further scope of tuning the cortexa57 vector cost to vectorize such cases.
> Any comments on this missed opportunity ?
When I added the vector costs for Cortex-A57, I followed the Cortex-A57
Software Optimisation Guide [1] you mentioned above. I took a lower-bound
estimate for each cost, which will certainly underestimate the
floating-point scalar costs.

So, I can believe that the costs will not be optimal for all test code
you can give them, and I'm happy to look at patches which improve the
vector costs. If you are planning to look at this, please feel free to
raise a bugzilla issue and assign it to yourself so we can track things.

Please be sure to test any changes across a range of workloads - from
time to time I've seen issues with the Cortex-A57 vector costs where
we have been too eager to vectorize and would have been better keeping
to scalar code.


[1]: Cortex-57 Software Optimisation Guide

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]