This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

target cost model tuning for x86

From: "Jagasia, Harsha" <harsha dot jagasia at amd dot com>
To: gcc-patches at gcc dot gnu dot org
Cc: "Dorit Nuzman" <DORIT at il dot ibm dot com>, "Jan Hubicka" <jh at suse dot cz>, "Sebastian Pop" <sebpop at gmail dot com>
Date: Thu, 6 Sep 2007 15:40:22 -0500
Subject: target cost model tuning for x86

Hello,

This patch defines the cost-model target-specific costs for x86 (only
AMD K8 for now). I looked at some Polyhedron benchmarks to arrive at the
costs. I also made some small fixes to the cost model itself.

- The processor costs for all processors have currently been filled in
with the default used by the cost model, except K8, which is tuned. For
now, I have filled in size_cost with just 1 byte for all statements. The
size_cost is not being used currently, but can be used in the future as
there was some discussion at the summit about using size as an
alternative to make the decision to vectorize.

- An issue I have run into is the testsuite. I have not added any test
cases because the test cases under vect/costmodel/x86 are run without
any -mtune options and hence they default to mtune=generic. So the
k8_costs added by this patch do not get used by make check right now. 

One way to address this is to add sub directories for individual x86
processors that can be tuned and replicating the test cases for each of
those processors with possibly different behavior when vectorized with
the cost model. 

Another way would be to use -mtune=native in the x86 costmodel-vect.exp
file and having the actual processor tuned emitted in the vectorizer
dump and having some macro along the lines of "istarget" that can search
for the processor string. Any input on the approaches would be very
welcome.

The following changes have been made to the cost model itself:

- When we don't know the number of prologue/epilogue iterations we
currently assume (VF-1)/2. In case of double precision on x86, this
causes prologue and epilogue to be estimated as 0, when alignment is
unknown. At this point if the number of iterations is unknown or not
exactly divisible by the vectorization factor, the cost model goes on to
account for guard branches in the epilogue even though the epilogue
iterations are estimated as 0. (This may be a different kind of a
problem on the spu also, where the branch cost is 6 and run time test
cost is -19. This may result in outside vector cost of (2*6) + (-19) =
-7, a negative value). This patch changes the estimated
prologue/epilogue, when unknown to vf/2 to avoid being estimated as 0
for double precision (this will still not help eradicate the corner case
mentioned above for the spu).

- The patch also changes the number of branches per prologue and
epilogue each to 1 instead of 2. Looking at some loops, the branches can
get optimized out, especially when the alignment and iterations are
known at compile time.
 
- The cost model equation is changed to 
min_profitable_iters = (voc * vf 
				- vic * prologue_iters
				- vic * epilogue_iters)
				/(sic * vf - vic)
       
instead of
min_iters = (voc * vf)/(sic * vf - vic)

- Currently, the cost model checks if the iterations are LESS than whats
estimated by the cost model at compile time, instead of LESS than or
EQUAL.

- The cost model checks at compile time if the loop iterations are less
than or equal to whats estimated by the cost model. It also checks at
run time if the iterations left after peeling for prologue and epilogue
are less than or equal to whats estimated by the cost model. If a loop
is determined as profitable at compile-time, it may be evaluated as
unprofitable at run-time.
For example, num_iters=7 and vf=4 and cost model estimates
profitable_iters=6 and data is aligned. At compile time since (num_iters
> min_profitable_iters), this loop is profitable to vectorize. However,
the run time check compares the iterations left after peeling for vf i.e
4 versus the min_profitable_iters and finds it to be unprofitable.
This patch sets the runtime threshold as 0 for cases which can be
evaluated for profitability at compile time. 

- The patch has been bootstrapped on x86-64 and all the 64 and 32-bit
tests pass as expected, except costmodel-vect-31.c in x86_64 and i386.

The loop at line 66 in vect-31.c was being estimated as not profitable
to vectorize. Because of the changes to the cost model, this is now
estimated as profitable. Running it on K8 indicates that it is
profitable to vectorize and the cost model estimates this correctly for
K8.

ChangeLogs:
gcc/ChangeLog:
2007-09-04  Harsha Jagasia <harsha.jagasia@amd.com>
	
	* config/i386/i386.h (processor_costs): Add scalar_stmt_cost, 
	  scalar_load_cost, scalar_store_cost, vec_stmt_cost,
	  vec_to_scalar_cost, scalar_to_vec_cost, vec_align_load_cost,
	  vect_unalign_load_cost, vec_store_cost.
	  Define macros for x86 costs.
	* config/i386/i386.c 
	  (size_cost): Set scalar_stmt_cost, scalar_load_cost,
	  scalar_store_cost, vec_stmt_cost, vec_to_scalar_cost,
	  scalar_to_vec_cost, vec_align_load_cost,
vect_unalign_load_cost,
	  vec_store_cost to 1. 
        (i386_cost, i486_cost, pentium_cost, pentiumpro_cost,
geode_cost,
	  k6_cost, athlon_cost, amdfam10_cost, pentium4_cost,
nocona_cost,
	  core2_cost, generic64_cost, generic32_cost): Set to default
untuned
	  costs.
	  (k8_cost): Costs for vectorization tuned.
	  (x86_builtin_vectorization_cost): New.
	* tree-vect-analyze.c (vect_analyze_operations): Change
comparison of
	  loop iterations with threshold to less than or equal to
instead of
	  less than.
	* tree-vect-transform.c (vect_estimate_min_profitable_iters): 
	  Change prologue and epilogue iterations estimate to vf/2, when
	  unknown at compile-time. Change the number of guards for
prologue
	  and epilogue to be 1 each. Change the cost model equation to
	  consider vector iterations as the loop iterations less the
	  prologue and epilogue iterations.
	  (vect_do_peeling_for_loop_bound): Set profitability threshold
to
	  zero when loop has known number of iterations and known
alignment
	  for data.

gcc/testsuite/ChangeLog:
2007-09-04  Harsha Jagasia <harsha.jagasia@amd.com>

	* gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Change
dg-final to
	  expect 1 non-profitable loop and 3 profitable loops.
	* gcc.dg/vect/costmodel/x86-64/costmodel-vect-31.c: Change
dg-final to
	  expect 1 non-profitable loop and 3 profitable loops.	

Thanks,
Harsha

Attachment: target.patch.txt
Description: target.patch.txt

Follow-Ups:
- Re: target cost model tuning for x86
  - From: Tobias Burnus
- Re: target cost model tuning for x86
  - From: Dorit Nuzman
- Re: target cost model tuning for x86
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]