This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[PATCH][AArch64] Tweak Cortex-A57 vector cost
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: GCC Patches <gcc-patches at gcc dot gnu dot org>
- Cc: nd <nd at arm dot com>
- Date: Thu, 10 Nov 2016 17:10:00 +0000
- Subject: [PATCH][AArch64] Tweak Cortex-A57 vector cost
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- Nodisclaimer: True
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
The existing vector costs stop some beneficial vectorization. This is mostly due
to vector statement cost being set to 3 as well as vector loads having a higher
cost than scalar loads. This means that even when we vectorize 4x, it is possible
that the cost of a vectorized loop is similar to the scalar version, and we fail
to vectorize. For example for a particular loop the costs for -mcpu=generic are:
note: Cost model analysis:
Vector inside of loop cost: 146
Vector prologue cost: 5
Vector epilogue cost: 0
Scalar iteration cost: 50
Scalar outside cost: 0
Vector outside cost: 5
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1
note: Runtime profitability threshold = 3
note: Static estimate profitability threshold = 3
note: loop vectorized
While -mcpu=cortex-a57 reports:
note: Cost model analysis:
Vector inside of loop cost: 294
Vector prologue cost: 15
Vector epilogue cost: 0
Scalar iteration cost: 74
Scalar outside cost: 0
Vector outside cost: 15
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 31
note: Runtime profitability threshold = 30
note: Static estimate profitability threshold = 30
note: not vectorized: vectorization not profitable.
note: not vectorized: iteration count smaller than user specified loop bound parameter or minimum profitable iterations (whichever is more conservative).
Using a cost of 3 for a vector operation suggests they are 3 times as
expensive as scalar operations. Since most vector operations have a
similar throughput as scalar operations, this is not correct.
Using slightly lower values for these heuristics now allows this loop
and many others to be vectorized. On a proprietary benchmark the gain
from vectorizing this loop is around 15-30% which shows vectorizing it is
indeed beneficial.
ChangeLog:
2016-11-10 Wilco Dijkstra <wdijkstr@arm.com>
* config/aarch64/aarch64.c (cortexa57_vector_cost):
Change vec_stmt_cost, vec_align_load_cost and vec_unalign_load_cost.
--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 279a6dfaa4a9c306bc7a8dba9f4f53704f61fefe..cff2e8fc6e9309e6aa4f68a5aba3bfac3b737283 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -382,12 +382,12 @@ static const struct cpu_vector_cost cortexa57_vector_cost =
1, /* scalar_stmt_cost */
4, /* scalar_load_cost */
1, /* scalar_store_cost */
- 3, /* vec_stmt_cost */
+ 2, /* vec_stmt_cost */
3, /* vec_permute_cost */
8, /* vec_to_scalar_cost */
8, /* scalar_to_vec_cost */
- 5, /* vec_align_load_cost */
- 5, /* vec_unalign_load_cost */
+ 4, /* vec_align_load_cost */
+ 4, /* vec_unalign_load_cost */
1, /* vec_unalign_store_cost */
1, /* vec_store_cost */
1, /* cond_taken_branch_cost */