This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [4.0 and mainline] Fix multiplication by constant expansion
> There appear to be two things you're misunderstanding. The first
> is that as far as the backend is concerned, all synthetic multiplication
> sequences are selected by the middle-end based upon cost. The term
> "latency" is a theoretical/hypothetical value used by synth_mult to
> split ties between sequences of the same cost. One possible
I get the overall idea of latency here. I was trying to argue however
that the hypotetical value computed is not reflecting any real benefits,
just leading the expansion code to chose "interesting" sollutions of
same overall cost.
It seems to me that expmult should have priorities such as
1) produce minimal cost
2) produce minimal latency
3) try to avoid minus and other operations that won't optimize into
address arithmetic
Actually it seems to me that 1/2 should be swapped for -O2, hot blocks
and superscalar targets, but since latency definition is bit slipperly,
it is probably not good idea in current scheme.
> Secondly, is understanding how "latency" is calculated. You are
> quite right that in your example above:
>
> tmp = (tmp << 5) + multiplicant
>
> the latency is min(shift_cost+add_cost, shiftadd_cost) + latency (tmp)
>
> The concurrency spotted in synth_mult is for the other use of
> shift_add instructions:
>
> tmp = (multiplicant << 5) + tmp
>
> where latency may be max(shift_cost,latency(tmp)) + add_cost
As i get the rules:
alg_add_t_m2 total := total + multiplicand * coeff;
alg_sub_t_m2 total := total - multiplicand * coeff;
alg_add_factor total := total * coeff + total;
alg_sub_factor total := total * coeff - total;
alg_add_t2_m total := total * coeff + multiplicand;
alg_sub_t2_m total := total * coeff - multiplicand;
the example you show is reflecting alg_add_t_m2 rather than
alg_add_factor I changed latency of. alg_add_factor can execute in
parallel iff the total*coeff can compute in parallel with last
computation of "total" (ie the parallelism introduced by CSE), that is
not really guarranteed to happen, if alg_sub_factor is first in sequence
or preceeded by something that won't combine.
>
> i.e. in this situation the multiplicant may be shifted in parallel
> with the calculation of "tmp". Please refer to the original patch
> descriptions to see why one is considered better than the other.
>
>
> Of course, latency calculation has absolutely nothing to do with your
> problem that you want a shift followed by an addition to be more expensive
> than an lea. In fact, you can remove the latency calculation altogether
> and simply reorder the recursive search to attempt lea before
> shift-and-add (or use <= instead of < in the code), to reintroduce the
> same non-determinism.
>
>
> Please just try the timings. If a simple one or two line tweak, gets
> you 10% (or more) on six-track, that is the best possible outcome.
Actually I tried it (it was first incarnation of fix) and it brings some
regressions too. It is not good to lie to expander that "lea" is as
cheap as atomic operation since it ends up producing many leas. Instead
I use the following patch I was about to send once testing finishes that
makes just one cost point cheaper (not whole cycle), so it is still
preferred over two discrete operations, but just by very tiny bit. This
also hanle the *11 sollution and I also re-implemented DECOMPSE_LEA flag
that was previously disabled because of bug and this particular problem.
I tested it on P4 too and didn't measured slowdowns (nor speedups off
noise however).
Does this patch look acceptable? (I can also do just the "-1" trick and
skip the DECOMPSE_LEA if it seems riskant for you, especially in 4.1)
Honza
* i386.c (lea_cost): New function.
(ix86_rtx_costs): Use it.
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c (revision 108713)
--- config/i386/i386.c (working copy)
*************** ix86_memory_move_cost (enum machine_mode
*** 16256,16261 ****
--- 16258,16288 ----
}
}
+ /* Return cost of LEA instruction replacing non-trivial arithmetic sequence
+ of overall cost ARITHMETIC_COST.
+
+ When the cost of lea is equivalent to cost of discrete seuqence, reduce the
+ cost tiny bit. This instruct optimizers to chose sequences that combine into
+ shorter and easier to register allocate leas. For example seuqnece for
+ multiplication by 11 on Athlon gets combined into two leas instead of lea
+ followed by shift and sub having overall same latency but being more
+ expensive. */
+ static int
+ lea_cost (int arithmetic_cost)
+ {
+ int cost = arithmetic_cost;
+
+ /* Some targets (P3/P4) decompose lea instruction internally into primitive
+ instructions so the latency depends on complexity of arithmetic presented.
+ */
+ if (!TARGET_DECOMPOSE_LEA)
+ cost = COSTS_N_INSNS (ix86_cost->lea);
+
+ if (arithmetic_cost == cost)
+ cost--;
+ return cost;
+ }
+
/* Compute a (partial) cost for rtx X. Return true if the complete
cost has been computed, and false if subexpressions should be
scanned. In either case, *TOTAL contains the cost result. */
*************** ix86_rtx_costs (rtx x, int code, int out
*** 16335,16340 ****
--- 16362,16368 ----
return false;
}
if ((value == 2 || value == 3)
+ && !TARGET_DECOMPOSE_LEA
&& ix86_cost->lea <= ix86_cost->shift_const)
{
*total = COSTS_N_INSNS (ix86_cost->lea);
*************** ix86_rtx_costs (rtx x, int code, int out
*** 16448,16454 ****
HOST_WIDE_INT val = INTVAL (XEXP (XEXP (XEXP (x, 0), 0), 1));
if (val == 2 || val == 4 || val == 8)
{
! *total = COSTS_N_INSNS (ix86_cost->lea);
*total += rtx_cost (XEXP (XEXP (x, 0), 1), outer_code);
*total += rtx_cost (XEXP (XEXP (XEXP (x, 0), 0), 0),
outer_code);
--- 16476,16483 ----
HOST_WIDE_INT val = INTVAL (XEXP (XEXP (XEXP (x, 0), 0), 1));
if (val == 2 || val == 4 || val == 8)
{
! *total = lea_cost (COSTS_N_INSNS (ix86_cost->add) * 2
! + COSTS_N_INSNS (ix86_cost->shift_const));
*total += rtx_cost (XEXP (XEXP (x, 0), 1), outer_code);
*total += rtx_cost (XEXP (XEXP (XEXP (x, 0), 0), 0),
outer_code);
*************** ix86_rtx_costs (rtx x, int code, int out
*** 16462,16468 ****
HOST_WIDE_INT val = INTVAL (XEXP (XEXP (x, 0), 1));
if (val == 2 || val == 4 || val == 8)
{
! *total = COSTS_N_INSNS (ix86_cost->lea);
*total += rtx_cost (XEXP (XEXP (x, 0), 0), outer_code);
*total += rtx_cost (XEXP (x, 1), outer_code);
return true;
--- 16491,16498 ----
HOST_WIDE_INT val = INTVAL (XEXP (XEXP (x, 0), 1));
if (val == 2 || val == 4 || val == 8)
{
! *total = lea_cost (COSTS_N_INSNS (ix86_cost->add)
! + COSTS_N_INSNS (ix86_cost->shift_const));
*total += rtx_cost (XEXP (XEXP (x, 0), 0), outer_code);
*total += rtx_cost (XEXP (x, 1), outer_code);
return true;
*************** ix86_rtx_costs (rtx x, int code, int out
*** 16470,16476 ****
}
else if (GET_CODE (XEXP (x, 0)) == PLUS)
{
! *total = COSTS_N_INSNS (ix86_cost->lea);
*total += rtx_cost (XEXP (XEXP (x, 0), 0), outer_code);
*total += rtx_cost (XEXP (XEXP (x, 0), 1), outer_code);
*total += rtx_cost (XEXP (x, 1), outer_code);
--- 16500,16506 ----
}
else if (GET_CODE (XEXP (x, 0)) == PLUS)
{
! *total = lea_cost (COSTS_N_INSNS (ix86_cost->add) * 2);
*total += rtx_cost (XEXP (XEXP (x, 0), 0), outer_code);
*total += rtx_cost (XEXP (XEXP (x, 0), 1), outer_code);
*total += rtx_cost (XEXP (x, 1), outer_code);