[patch] expmed.c: Speed up synth_mult further.
Kazu Hirata
kazu@codesourcery.com
Thu Sep 22 01:13:00 GMT 2005
Hi,
Attached is a patch to further speed up synth_mult.
First, note that we have four major ways to implement multiplication.
1. alg_shift
Left shifts. Only for even multipliers for obvious reasons.
2. alg_add_t_m2 and alg_sub_t_m2
Only for odd multipliers t. Transforms a * t into
a * (t - 1) + a or a * (t + 1) - a.
3. alg_add_factor and alg_sub_factor
For both odd and even multipliers t. Transforms a * t into
a * (2^m + 1) * t' or a * (2^m - 1) * t'
for some m >= 2.
4. alg_add_t2_m and alg_sub_t2_m
Only for odd multipliers t. Transforms a * t into
a * (t - 1)/4 + a or a * (t + 1)/4 - a
Notice that if a given multiplier is even, we have two ways to break
down the multiplication. We could use alg_shift or
alg_{add,sub}_factor. Suppose we pick alg_{add/sub}_factor. Since
2^m +/- 1 is odd for any m >= 2, the remaining multiplier
t / (2^m +/- 1) is still even. Again, we have the same choices -
alg_shift or alg_{add,sub}_factor. Note that however many times we
choose alg_{add,sub}_factor, we eventually must take alg_shift because
eventually we will run out of factors of the form 2^m +/- 1 but the
multiplier stays even.
For example, if we are given multipler 40, we could emit either
(a << 3) * 5 or (a * 5) << 3. These choices for even multipliers
increases the branching factor of the algorithm.
The patch solves this problem by using alg_shift exclusively for even
multipliers.
Here is a timing in seconds using the cross compiler from
x86_64-pc-linux-gnu to alpha-linux-gnu to compile the testcase in
PR 23971.
original patched diff(%)
0.685 0.175 -74.5 (wow!)
The patch is too big for what it's really doing.
The original code looks like
if (t is even)
{
try alg_shift.
}
if (t is odd)
{
try alg_add_t_m2 and alg_sub_t_m2.
}
try alg_add_factor and alg_sub_factor
if (t is odd)
{
try alg_add_t2_m and alg_sub_t2_m.
}
With my patch, the code will look like
if (t is even)
{
try alg_shift.
}
else
{
try alg_add_t_m2 and alg_sub_t_m2.
try alg_add_factor and alg_sub_factor
try alg_add_t2_m and alg_sub_t2_m.
}
Tested on x86_64-pc-linux-gnu. OK to apply?
Kazu Hirata
2005-09-21 Kazu Hirata <kazu@codesourcery.com>
* expmed.c (synth_mult): Speed up by using alg_shift
exclusively for even multipliers.
Index: expmed.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/expmed.c,v
retrieving revision 1.235
diff -c -d -p -r1.235 expmed.c
*** expmed.c 21 Sep 2005 16:32:10 -0000 1.235
--- expmed.c 21 Sep 2005 17:05:04 -0000
*************** synth_mult (struct algorithm *alg_out, u
*** 2566,2577 ****
if (cache_hit)
goto done;
}
!
! /* If we have an odd number, add or subtract one. */
! if ((t & 1) != 0)
{
unsigned HOST_WIDE_INT w;
do_alg_addsub_t_m2:
for (w = 1; (w & t) != 0; w <<= 1)
;
--- 2566,2576 ----
if (cache_hit)
goto done;
}
! else
{
unsigned HOST_WIDE_INT w;
+ /* If we have an odd number, add or subtract one. */
do_alg_addsub_t_m2:
for (w = 1; (w & t) != 0; w <<= 1)
;
*************** synth_mult (struct algorithm *alg_out, u
*** 2626,2732 ****
}
if (cache_hit)
goto done;
- }
-
- /* Look for factors of t of the form
- t = q(2**m +- 1), 2 <= m <= floor(log2(t - 1)).
- If we find such a factor, we can multiply by t using an algorithm that
- multiplies by q, shift the result by m and add/subtract it to itself.
! We search for large factors first and loop down, even if large factors
! are less probable than small; if we find a large factor we will find a
! good sequence quickly, and therefore be able to prune (by decreasing
! COST_LIMIT) the search. */
! do_alg_addsub_factor:
! for (m = floor_log2 (t - 1); m >= 2; m--)
! {
! unsigned HOST_WIDE_INT d;
! d = ((unsigned HOST_WIDE_INT) 1 << m) + 1;
! if (t % d == 0 && t > d && m < maxm
! && (!cache_hit || cache_alg == alg_add_factor))
{
! /* If the target has a cheap shift-and-add instruction use
! that in preference to a shift insn followed by an add insn.
! Assume that the shift-and-add is "atomic" with a latency
! equal to its cost, otherwise assume that on superscalar
! hardware the shift may be executed concurrently with the
! earlier steps in the algorithm. */
! op_cost = add_cost[mode] + shift_cost[mode][m];
! if (shiftadd_cost[mode][m] < op_cost)
{
! op_cost = shiftadd_cost[mode][m];
! op_latency = op_cost;
! }
! else
! op_latency = add_cost[mode];
! new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_latency;
! synth_mult (alg_in, t / d, &new_limit, mode);
! alg_in->cost.cost += op_cost;
! alg_in->cost.latency += op_latency;
! if (alg_in->cost.latency < op_cost)
! alg_in->cost.latency = op_cost;
! if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! {
! struct algorithm *x;
! best_cost = alg_in->cost;
! x = alg_in, alg_in = best_alg, best_alg = x;
! best_alg->log[best_alg->ops] = m;
! best_alg->op[best_alg->ops] = alg_add_factor;
}
- /* Other factors will have been taken care of in the recursion. */
- break;
- }
! d = ((unsigned HOST_WIDE_INT) 1 << m) - 1;
! if (t % d == 0 && t > d && m < maxm
! && (!cache_hit || cache_alg == alg_sub_factor))
! {
! /* If the target has a cheap shift-and-subtract insn use
! that in preference to a shift insn followed by a sub insn.
! Assume that the shift-and-sub is "atomic" with a latency
! equal to it's cost, otherwise assume that on superscalar
! hardware the shift may be executed concurrently with the
! earlier steps in the algorithm. */
! op_cost = add_cost[mode] + shift_cost[mode][m];
! if (shiftsub_cost[mode][m] < op_cost)
{
! op_cost = shiftsub_cost[mode][m];
! op_latency = op_cost;
! }
! else
! op_latency = add_cost[mode];
! new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_latency;
! synth_mult (alg_in, t / d, &new_limit, mode);
! alg_in->cost.cost += op_cost;
! alg_in->cost.latency += op_latency;
! if (alg_in->cost.latency < op_cost)
! alg_in->cost.latency = op_cost;
! if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! {
! struct algorithm *x;
! best_cost = alg_in->cost;
! x = alg_in, alg_in = best_alg, best_alg = x;
! best_alg->log[best_alg->ops] = m;
! best_alg->op[best_alg->ops] = alg_sub_factor;
}
- break;
}
! }
! if (cache_hit)
! goto done;
! /* Try shift-and-add (load effective address) instructions,
! i.e. do a*3, a*5, a*9. */
! if ((t & 1) != 0)
! {
do_alg_add_t2_m:
q = t - 1;
q = q & -q;
--- 2625,2728 ----
}
if (cache_hit)
goto done;
! /* Look for factors of t of the form
! t = q(2**m +- 1), 2 <= m <= floor(log2(t - 1)).
! If we find such a factor, we can multiply by t using an algorithm that
! multiplies by q, shift the result by m and add/subtract it to itself.
! We search for large factors first and loop down, even if large factors
! are less probable than small; if we find a large factor we will find a
! good sequence quickly, and therefore be able to prune (by decreasing
! COST_LIMIT) the search. */
! do_alg_addsub_factor:
! for (m = floor_log2 (t - 1); m >= 2; m--)
{
! unsigned HOST_WIDE_INT d;
!
! d = ((unsigned HOST_WIDE_INT) 1 << m) + 1;
! if (t % d == 0 && t > d && m < maxm
! && (!cache_hit || cache_alg == alg_add_factor))
{
! /* If the target has a cheap shift-and-add instruction use
! that in preference to a shift insn followed by an add insn.
! Assume that the shift-and-add is "atomic" with a latency
! equal to its cost, otherwise assume that on superscalar
! hardware the shift may be executed concurrently with the
! earlier steps in the algorithm. */
! op_cost = add_cost[mode] + shift_cost[mode][m];
! if (shiftadd_cost[mode][m] < op_cost)
! {
! op_cost = shiftadd_cost[mode][m];
! op_latency = op_cost;
! }
! else
! op_latency = add_cost[mode];
! new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_latency;
! synth_mult (alg_in, t / d, &new_limit, mode);
! alg_in->cost.cost += op_cost;
! alg_in->cost.latency += op_latency;
! if (alg_in->cost.latency < op_cost)
! alg_in->cost.latency = op_cost;
! if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! {
! struct algorithm *x;
! best_cost = alg_in->cost;
! x = alg_in, alg_in = best_alg, best_alg = x;
! best_alg->log[best_alg->ops] = m;
! best_alg->op[best_alg->ops] = alg_add_factor;
! }
! /* Other factors will have been taken care of in the recursion. */
! break;
}
! d = ((unsigned HOST_WIDE_INT) 1 << m) - 1;
! if (t % d == 0 && t > d && m < maxm
! && (!cache_hit || cache_alg == alg_sub_factor))
{
! /* If the target has a cheap shift-and-subtract insn use
! that in preference to a shift insn followed by a sub insn.
! Assume that the shift-and-sub is "atomic" with a latency
! equal to it's cost, otherwise assume that on superscalar
! hardware the shift may be executed concurrently with the
! earlier steps in the algorithm. */
! op_cost = add_cost[mode] + shift_cost[mode][m];
! if (shiftsub_cost[mode][m] < op_cost)
! {
! op_cost = shiftsub_cost[mode][m];
! op_latency = op_cost;
! }
! else
! op_latency = add_cost[mode];
! new_limit.cost = best_cost.cost - op_cost;
! new_limit.latency = best_cost.latency - op_latency;
! synth_mult (alg_in, t / d, &new_limit, mode);
! alg_in->cost.cost += op_cost;
! alg_in->cost.latency += op_latency;
! if (alg_in->cost.latency < op_cost)
! alg_in->cost.latency = op_cost;
! if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! {
! struct algorithm *x;
! best_cost = alg_in->cost;
! x = alg_in, alg_in = best_alg, best_alg = x;
! best_alg->log[best_alg->ops] = m;
! best_alg->op[best_alg->ops] = alg_sub_factor;
! }
! break;
}
}
! if (cache_hit)
! goto done;
! /* Try shift-and-add (load effective address) instructions,
! i.e. do a*3, a*5, a*9. */
do_alg_add_t2_m:
q = t - 1;
q = q & -q;
More information about the Gcc-patches
mailing list