This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[4.0 and mainline] Fix multiplication by constant expansion
- From: Jan Hubicka <jh at suse dot cz>
- To: gcc-patches at gcc dot gnu dot org, rth at redhat dot com
- Date: Sun, 1 Jan 2006 21:39:18 +0100
- Subject: [4.0 and mainline] Fix multiplication by constant expansion
Hi,
currently with -march=athlon we expand multiplication by 11 as
leal 0(,%edx,4), %ecx
movl %edx, %eax
sall $4, %eax
subl %ecx, %eax
subl %edx, %eax
that is both longer and slower than imul. Better equivalent would be:
leal (%edx,%edx,4), %eax
leal (%edx,%eax,2), %eax
that we sucesfully produce for -march=i686. The problem is actually in
the generic expander that assume latency of rule:
alg_add_t2_m total := total * coeff + multiplicand
to be equivalent of add latency claiming that shift and add can execute
in parallel, that is wrong, since there is dependency chain.
The patch fixes that but leaves somewhat absurd case where we always
have same latency and cost. This is correct because the algorithm as
done currently never introduce any parallelism. It theoretically can
with:
alg_add_t_m2 total := total + multiplicand * coeff;
but it never use it with coeff != 1. I am going to try this adding
non-1 coefficients to see if it leads to anything good. Athlon is
however example where shift-and-add instruction behave atomically with
latency of 2 so it won't introduce parallelism anyway.
Perhaps it would make sense to add specialized imul expander for i386 if
it can do signficantly better, I will try this out too.
The patch was bootstrapped/regtested i686-linux and it speeds up
sixtrack by over 10% with -march=athlon as well as few other bechmarks
on K8. I also benchmarked P4 and saw no regressions there.
Thanks to H. J. Lu who originally pointed out the problem and Zdenek who
figured out the testcase! OK for mainline and branch?
/* { dg-do compile { target i?86-*-* } } */
/* { dg-options "-O2 -march=athlon" } */
/* Multiplication by 11 should expand as sequence of two leas. */
/* { dg-final { scan-assembler "lea" } } */
/* { dg-final { scan-assembler-not "sub" } } */
/* { dg-final { scan-assembler-not "add" } } */
/* { dg-final { scan-assembler-not "neg" } } */
int a(int b)
{
return b*11;
}
2006-01-01 Zdenek Dvorak <dvorakz@suse.cz>
Jan Hubicka <jh@suse.cz>
H.J. Lu <hongjiu.lu@intel.com>
* expmed.c (synth_mult): Fix computing of latency.
Index: expmed.c
===================================================================
*** expmed.c (revision 108519)
--- expmed.c (working copy)
*************** synth_mult (struct algorithm *alg_out, u
*** 2657,2676 ****
/* If the target has a cheap shift-and-add instruction use
that in preference to a shift insn followed by an add insn.
Assume that the shift-and-add is "atomic" with a latency
! equal to its cost, otherwise assume that on superscalar
! hardware the shift may be executed concurrently with the
! earlier steps in the algorithm. */
op_cost = add_cost[mode] + shift_cost[mode][m];
if (shiftadd_cost[mode][m] < op_cost)
! {
! op_cost = shiftadd_cost[mode][m];
! op_latency = op_cost;
! }
else
op_latency = add_cost[mode];
new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, t / d, &new_limit, mode);
alg_in->cost.cost += op_cost;
--- 2657,2671 ----
/* If the target has a cheap shift-and-add instruction use
that in preference to a shift insn followed by an add insn.
Assume that the shift-and-add is "atomic" with a latency
! equal to its cost */
op_cost = add_cost[mode] + shift_cost[mode][m];
if (shiftadd_cost[mode][m] < op_cost)
! op_cost = shiftadd_cost[mode][m];
else
op_latency = add_cost[mode];
new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_cost;
synth_mult (alg_in, t / d, &new_limit, mode);
alg_in->cost.cost += op_cost;
*************** synth_mult (struct algorithm *alg_out, u
*** 2696,2715 ****
/* If the target has a cheap shift-and-subtract insn use
that in preference to a shift insn followed by a sub insn.
Assume that the shift-and-sub is "atomic" with a latency
! equal to it's cost, otherwise assume that on superscalar
! hardware the shift may be executed concurrently with the
! earlier steps in the algorithm. */
op_cost = add_cost[mode] + shift_cost[mode][m];
if (shiftsub_cost[mode][m] < op_cost)
! {
! op_cost = shiftsub_cost[mode][m];
! op_latency = op_cost;
! }
else
! op_latency = add_cost[mode];
new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, t / d, &new_limit, mode);
alg_in->cost.cost += op_cost;
--- 2691,2705 ----
/* If the target has a cheap shift-and-subtract insn use
that in preference to a shift insn followed by a sub insn.
Assume that the shift-and-sub is "atomic" with a latency
! equal to it's cost */
op_cost = add_cost[mode] + shift_cost[mode][m];
if (shiftsub_cost[mode][m] < op_cost)
! op_cost = shiftsub_cost[mode][m];
else
! op_latency = op_cost;
new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_cost;
synth_mult (alg_in, t / d, &new_limit, mode);
alg_in->cost.cost += op_cost;
*************** synth_mult (struct algorithm *alg_out, u
*** 2742,2753 ****
{
op_cost = shiftadd_cost[mode][m];
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_cost;
synth_mult (alg_in, (t - 1) >> m, &new_limit, mode);
alg_in->cost.cost += op_cost;
alg_in->cost.latency += op_cost;
! if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
{
struct algorithm *x;
best_cost = alg_in->cost;
--- 2732,2759 ----
{
op_cost = shiftadd_cost[mode][m];
new_limit.cost = best_cost.cost - op_cost;
+ /* Because we want to preffer combined lea over disjoint
+ operations, we are interested in sollutions having equal
+ cost to best one we found so far. */
+ if (best_cost.cost != cost_limit->cost && m)
+ new_limit.cost++;
new_limit.latency = best_cost.latency - op_cost;
+ if (best_cost.latency != cost_limit->latency && m)
+ new_limit.latency++;
synth_mult (alg_in, (t - 1) >> m, &new_limit, mode);
alg_in->cost.cost += op_cost;
alg_in->cost.latency += op_cost;
! if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost)
! /* When the costs are equivalent, still preffer to use
! shiftadd. This should reduce total number of instructions
! produced and register presure since intermediate result
! of shift can't be reused later.
! This help (for instance) some i386 CPUs where lea cost is
! equivalent to cost of disjoint operations latency and cost
! wise, while it still is shorter to encode and easier to
! register allocate. */
! || (m && !CHEAPER_MULT_COST (&best_cost, &alg_in->cost)))
{
struct algorithm *x;
best_cost = alg_in->cost;