This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
Hi,
this is proof of concept patch for vectorizer costs to use costs used for rtx_cost
and register_move_cost which are readily available in ix86_costs instead of using
its own set of random values. At least until we have proof of evidence that vectroizer
costs needs to differ, I do not think we want to complicate CPU tuning by having them
twice.
This is of course quite intrusive change to what we have becuase it affects all
x86 targets. I have finally worked out that the "random" values used by AMD target
corresponds to latencies of bdver1.
I have benchmarked them on Zen and also temporarily patches Czerny (Haswel).
It seems to cause no regression and quite nice improvements:
- 27.3% for facerec on Zen
- 7% for mgrid on Haswel
- maybe 1% for galgel of Haswell
- 3% for facerec on Haswell
- maybe 1% aspi on Haswell
- there may be small off-noise improvement for rnflow and regression for fatigue2 on Haswell
So I would say that outcome is surprisingly good (especially due to lack of
noteworthy regressions). I also know that vectorizer hurts performance on Zen and
Mesa/tonto benchmarks which is not cured by this patch alone.
There is testsuite fallout though.
./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc -std=c++11 scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc -std=c++14 scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc -std=c++98 scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
Here we vectorize the loop before first while originally we unrolled and SLP vectorized next
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_1.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_2.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_3.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_4.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_5.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_6.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_1.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_2.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_3.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_4.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_5.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_6.c scan-assembler-times vfnmsub[123]+ss 120 (found 64 times)
And friends, clearly we do not vectorize all loops, I did not look into details yet
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr61403.c scan-assembler blend
Here again we vectorize loop while originally we did SLP. I am not sure why loop
vectorizer does not use blend.
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79683.c scan-assembler-times padd 1 (found 0 times)
Here we are supposed to vectorize two integer additions, but since generic cost model now claims that
latency of vector add is twice of integer add we don't. I think it makes sense.
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79723.c scan-assembler mov[au]p.[ \t][^,]+, %gs:
Similarly here.
If it seems to make sense, I will clean it up (remove now unused entries and scale
conditional costs by COSTS_N_INSNS) and fix the tessuite fallout.
Honza
Index: i386.c
===================================================================
--- i386.c (revision 253824)
+++ i386.c (working copy)
@@ -44015,50 +44015,56 @@ static int
ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
tree vectype, int)
{
+ bool fp = false;
+ if (vectype != NULL)
+ fp = FLOAT_TYPE_P (vectype);
+
switch (type_of_cost)
{
case scalar_stmt:
- return ix86_cost->scalar_stmt_cost;
+ return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
case scalar_load:
- return ix86_cost->scalar_load_cost;
+ return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
+ : ix86_cost->int_load [2]) / 2;
case scalar_store:
- return ix86_cost->scalar_store_cost;
+ return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
+ : ix86_cost->int_store [2]) / 2;
case vector_stmt:
- return ix86_cost->vec_stmt_cost;
+ return fp ? ix86_cost->addss : ix86_cost->sse_op;
case vector_load:
- return ix86_cost->vec_align_load_cost;
+ return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2;
case vector_store:
- return ix86_cost->vec_store_cost;
+ return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2;
case vec_to_scalar:
- return ix86_cost->vec_to_scalar_cost;
-
case scalar_to_vec:
- return ix86_cost->scalar_to_vec_cost;
+ return ix86_cost->sse_op;
case unaligned_load:
- case unaligned_store:
case vector_gather_load:
+ return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2;
+
+ case unaligned_store:
case vector_scatter_store:
- return ix86_cost->vec_unalign_load_cost;
+ return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2;
case cond_branch_taken:
- return ix86_cost->cond_taken_branch_cost;
+ return COSTS_N_INSNS (ix86_cost->cond_taken_branch_cost);
case cond_branch_not_taken:
- return ix86_cost->cond_not_taken_branch_cost;
+ return COSTS_N_INSNS (ix86_cost->cond_not_taken_branch_cost);
case vec_perm:
case vec_promote_demote:
- return ix86_cost->vec_stmt_cost;
+ return ix86_cost->sse_op;
case vec_construct:
- return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
+ return ix86_cost->sse_op * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
default:
gcc_unreachable ();