This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [AArch64] Missed vectorization opportunity in cactusADM


On Wed, Apr 8, 2015 at 5:14 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Thu, Apr 02, 2015 at 04:20:06AM +0100, Ekanathan, Saravanan wrote:
>> (I had sent this mail to gcc-help a week ago. Not sure, all GCC developers
>> are subscribed to gcc-help, so re-sending to GCC development mailing list)
>>
>> Hi,
>>
>> This looks like a missed vectorization opportunity for one of the 'Fortran'
>> hot loops in cactusADM (CPU2006 benchmark) when compiled with
>> "-mcpu=cortex-a57 -Ofast".  Interestingly, the 'generic' model (compiled with
>> plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop,
>> hence there is good runtime performance improvement noticed on native Aarch64
>> platform.
>>
>> I don't have a small reproducible testcase, hence quoting cactusADM benchmark
>> here.  The hot loop is present in Bench_StaggeredLeapfrog2() in
>> StaggeredLeapfrog2.F file.
>>
>> For cortex-a57, vectorization report clearly mentions that scalar cost <
>> vector_cost/vectorization_factor, hence didn't vectorize.
>>
>> For generic case, due to un-tuned vector cost model, the scalar cost >
>> vector_cost/vectorization_factor  (since scalar_cost = vector_cost), so the
>> loop got vectorized
>>
>>    << Output of  generic vectorized case>>
>>      StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: note: LOOP VECTORIZED
>>
>> I have also played around with cortexa57_vector_cost table(esp.,
>> scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost  etc..,), which
>> influences the vectorization decision in this case.  The
>> cortexa57_vector_cost table directly maps to the cost mentioned in
>> "Cortex-A57 Software Optimisation Guide".  But, it looks like there is
>> further scope of tuning the cortexa57 vector cost to vectorize such cases.
>>
>> Any comments on this missed opportunity ?
>
> When I added the vector costs for Cortex-A57, I followed the Cortex-A57
> Software Optimisation Guide [1] you mentioned above. I took a lower-bound
> estimate for each cost, which will certainly underestimate the
> floating-point scalar costs.
>
> So, I can believe that the costs will not be optimal for all test code
> you can give them, and I'm happy to look at patches which improve the
> vector costs. If you are planning to look at this, please feel free to
> raise a bugzilla issue and assign it to yourself so we can track things.
>
> Please be sure to test any changes across a range of workloads - from
> time to time I've seen issues with the Cortex-A57 vector costs where
> we have been too eager to vectorize and would have been better keeping
> to scalar code.

Speaking of AARCH64 vector costs there is the completely bogus

/* Implement targetm.vectorize.add_stmt_cost.  */
static unsigned
aarch64_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
                       struct _stmt_vec_info *stmt_info, int misalign,
                       enum vect_cost_model_location where)
{
...
      /* Statements in an inner loop relative to the loop being
         vectorized are weighted more heavily.  The value here is
         a function (linear for now) of the loop nest level.  */
      if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
        {
          loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_info);
          struct loop *loop =  LOOP_VINFO_LOOP (loop_info);
          unsigned nest_level = loop_depth (loop);

          count *= nest_level;
        }

where somebody appearantly tried to be clever, not noticing that the
code present
in every other target is that odd because it matches the vectorizer
code.  You should
really replace this with the code from arm.c for example, that is,

Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c        (revision 221918)
+++ gcc/config/aarch64/aarch64.c        (working copy)
@@ -6588,13 +6588,7 @@ aarch64_add_stmt_cost (void *data, int c
         vectorized are weighted more heavily.  The value here is
         a function (linear for now) of the loop nest level.  */
       if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
-       {
-         loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_info);
-         struct loop *loop =  LOOP_VINFO_LOOP (loop_info);
-         unsigned nest_level = loop_depth (loop);
-
-         count *= nest_level;
-       }
+       count *= 50;  /* FIXME.  */

       retval = (unsigned) (count * stmt_cost);
       cost[where] += retval;

see tree-vect-loop.c:vect_get_single_scalar_iteration_cost:

...
  /* FORNOW.  */
  innerloop_iters = 1;
  if (loop->inner)
    innerloop_iters = 50; /* FIXME */
...

yes, that '50' should be a parameter somewhere in loop_vec_info.

Richard.

> Thanks,
> James
>
> ---
> [1]: Cortex-57 Software Optimisation Guide
>   http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.uan0015a/index.html
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]