This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Tweak signed division by power of two
- From: Roger Sayle <roger at eyesopen dot com>
- To: Richard Henderson <rth at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Tue, 13 Jul 2004 16:56:11 -0600 (MDT)
- Subject: Re: [PATCH] Tweak signed division by power of two
On Tue, 13 Jul 2004, Richard Henderson wrote:
> On Tue, Jul 13, 2004 at 11:48:46AM -0600, Roger Sayle wrote:
> > * expmed.c (expand_sdiv_pow2): New function to expand signed division
> > by a positive power of two, split out from expand_divmod. Provide
> > an alternate implementation when shifts are expensive. Lower the
> > threshold for using a branchless implementation to BRANCH_COST >= 2.
> > (expand_divmod): Call expand_sdiv_pow2 for suitable divisions.
>
> Hum. There's a lot of tweaking that could be done here.
>
> > + if (shift_cost[mode][ushift] > COSTS_N_INSNS (1))
>
> (1) AND with non-register operands is not necessarily cost 1.
> In particular, you'd want to consider the cost of loading
> the constant.
Indeed. As I mentioned in my e-mail, this is why I used
"> COSTS_N_INSNS (1)" and not ">= COSTS_N_INSNS (1)", as although both
the RSHIFT and the AND both have immediate operands, most architectures
are guaranteed to handle the smaller operands for the shift in a single
instruction.
Ahh, looking at alpha.c's alpha_rtx_cost_data, I see the concern.
The EV4, has COSTS_N_INSNS(2) for a shift, and the EV5 has
COSTS_N_INSNS(1)+1. So my "> COSTS_N_INSNS(1)" isn't equivalent
to ">= COSTS_N_INSNS(2)" as I'd assumed. Given that the Pentium4
has a shift cost of COSTS_N_INSNS(4), would you be happy changing
the above test to:
if (shift_cost[mode][ushift] >= COSTS_N_INSNS (2))
This achieves perfect selectivity between the (b) and (c) cases
in your e-mail. With this change, on alpha we'll always choose
the faster of (b) and (c), and when the latencies are equivalent,
we'll choose the shorter.
> > + label = gen_label_rtx ();
> > + temp = copy_to_mode_reg (mode, op0);
> > + do_cmp_and_jump (temp, const0_rtx, GE, mode, label);
> > + expand_inc (temp, GEN_INT (d - 1));
> > + emit_label (label);
> > + return expand_shift (RSHIFT_EXPR, mode, temp, shift, NULL_RTX, 0);
>
> (2) This sequence is particularly good for conditional-move targets.
> For /1024, on alpha, we can chose from...
> Given equivalent latencies, I'd prefer the smallest alternative.
This is trickier. The huge BRANCH_COST of six means that we've always
generated a straight-line sequence as initial RTL, and I doubt that
if-conversion or combine is anywhere near clever enough to recover the
conditional move sequence.
I'll be happy to investigate a follow-up patch to check whether the
target has a suitable/cheap conditional move instruction, and use
your proposed (a) sequence instead. For the EV4, EV5 and EV6 timings
and sizes you gave, the conditional move is always preferable. The
clean-up of moving this functionality out of expand_divmod into the
new expand_sdiv_pow2 will make such experimentation much cleaner.
With the proposed ammendment above, this patch should now generate
alpha/x86 code atleast as good as originally, and in the case of EV4
and Pentium4 even better than we were doing before.
Ok for mainline with the above change?
Roger
--