This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[Committed] PR17236: Improve double-word multiply RTL expansion
- From: Roger Sayle <roger at eyesopen dot com>
- To: gcc-patches at gcc dot gnu dot org
- Date: Mon, 14 Mar 2005 11:24:35 -0700 (MST)
- Subject: [Committed] PR17236: Improve double-word multiply RTL expansion
The following patch (partially) resolves PR rtl-optimization/17236 which
is an inefficiency in the double-word multiply sequences synthesized by
GCC. Performing a double-word multiply typically requires a minimum of
one widenening unsigned multiply, two non-widening multiplies and three
additions. Whilst the sequences generated by GCC's expand_binop contain
this minimum number of operations, the order in which they are performed
results in poor register allocation.
Consider GCC's current sequence, given the two operands, op0 and op1,
which are held in four registers op0_high, op0_low, op1_high and op1_low,
we need to produce the product, which is held in two registers,
"product_high" and "product_low".
Mainline CVS currently generates:
// live
{product_high,product_low} = widen_mult(op0_low,op1_low) // 6
temp = op0_high * op1_low; // 5
product_high += temp; // 4
temp = op1_high * op0_low; // 3
product_high += temp; // 2
using the minimum number of operations. However, now consider the
register lifetimes in this example, imediately after widening multiply
six live registers. The values of op0_high, op0_low, op1_high, op1_low,
product_high and product_low are all live simultaneously.
A better implementation is to perform these exact same operations but
in a slightly different order.
// live
adjust = op0_high * op1_low; // op0_high is now dead // 4
temp = op1_high * op0_low; // op1_high is now dead // 4
adjust += temp; // // 3
{product_high,product_low} = widen_mult(op0_low,op1_low) // 3
product_high += adjust;
Notice with this permuted order, we perform exactly the same arithmetic
operations but only require a maximum of 4 live registers, vs. the 6
required by the original sequence. On register starved architectures
such as the x86 this is a big win, but it should also improve register
pressure on other targets.
To demonstrate the improvement we get with -O2 -fomit-frame-pointer:
unsigned long long foo(unsigned long long a, unsigned long long b)
{
return a * b;
}
Before: pushl %ebp
pushl %edi
pushl %ebx
subl $8, %esp
movl 32(%esp), %ecx
movl 36(%esp), %ebx
movl 24(%esp), %eax
mull %ecx
movl %eax, (%esp)
movl 24(%esp), %edi
imull %ebx, %edi
movl %edx, %ebp
addl %edi, %ebp
imull 28(%esp), %ecx
leal (%ebp,%ecx), %ecx
movl %ecx, 4(%esp)
movl (%esp), %eax
movl 4(%esp), %edx
addl $8, %esp
popl %ebx
popl %edi
popl %ebp
ret
After: pushl %edi
pushl %esi
pushl %ebx
movl 16(%esp), %eax
movl 20(%esp), %edx
movl 24(%esp), %ecx
movl 28(%esp), %ebx
movl %edx, %esi
imull %ecx, %esi
movl %ebx, %edi
imull %eax, %edi
addl %edi, %esi
mull %ecx
leal (%esi,%edx), %edx
popl %ebx
popl %esi
popl %edi
ret
Notice that by performing the "mull" after the "imuls", we no longer need
to spill registers to memory. Unfortunately, reload and the i386 backend
still conspire against us, and use six registers instead of the minimum
possible four, so we're still not quite as efficient as the Intel's icc
compiler, but the problem is now a generic reload defficiency rather than
also an RTL expansion problem. [I may have an i386.md specific fix...]
The one minor technical difficulty is that calculating "adjust" before
we call the widening multiply requires that we know in advance whether
we're using a signed widening multiply or an unsigned widening multiply.
[This is probably the reason the current code uses the order it does].
The solution is to split out a new subroutine "expand_doubleword_mult"
from "expand_binop", that we can call once to attempt using an unsigned
widening multiply, and if that fails clean-up and try again with a
signed widening multiply. Breaking this subroutine out of "expand_binop"
is probably a worthwhile clean-up in its own right. I decided not to
make any additional changes with this patch, to avoid breaking more
than I have to, but it may be possible to improve code generation for
synthetic doubleword multiplies on some platforms by using the
emit_store_flag function to create a signmask and/or signbit.
The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.
Committed to mainline CVS.
2005-03-14 Roger Sayle <roger@eyesopen.com>
PR rtl-optimization/17236
* optabs.c (expand_doubleword_mult): New helper function split out
from expand_binop. Permute the order in which instructions are
emitted to minimize the number of simultaneously live registers.
(expand_binop): Call expand_doubleword_mult to synthesize a double
word multiplication.
Index: optabs.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/optabs.c,v
retrieving revision 1.260
diff -c -3 -p -r1.260 optabs.c
*** optabs.c 12 Feb 2005 11:34:21 -0000 1.260
--- optabs.c 14 Mar 2005 14:37:40 -0000
*************** expand_doubleword_shift (enum machine_mo
*** 756,761 ****
--- 756,923 ----
return true;
}
+ /* Subroutine of expand_binop. Perform a double word multiplication of
+ operands OP0 and OP1 both of mode MODE, which is exactly twice as wide
+ as the target's word_mode. This function return NULL_RTX if anything
+ goes wrong, in which case it may have already emitted instructions
+ which need to be deleted.
+
+ If we want to multiply two two-word values and have normal and widening
+ multiplies of single-word values, we can do this with three smaller
+ multiplications. Note that we do not make a REG_NO_CONFLICT block here
+ because we are not operating on one word at a time.
+
+ The multiplication proceeds as follows:
+ _______________________
+ [__op0_high_|__op0_low__]
+ _______________________
+ * [__op1_high_|__op1_low__]
+ _______________________________________________
+ _______________________
+ (1) [__op0_low__*__op1_low__]
+ _______________________
+ (2a) [__op0_low__*__op1_high_]
+ _______________________
+ (2b) [__op0_high_*__op1_low__]
+ _______________________
+ (3) [__op0_high_*__op1_high_]
+
+
+ This gives a 4-word result. Since we are only interested in the
+ lower 2 words, partial result (3) and the upper words of (2a) and
+ (2b) don't need to be calculated. Hence (2a) and (2b) can be
+ calculated using non-widening multiplication.
+
+ (1), however, needs to be calculated with an unsigned widening
+ multiplication. If this operation is not directly supported we
+ try using a signed widening multiplication and adjust the result.
+ This adjustment works as follows:
+
+ If both operands are positive then no adjustment is needed.
+
+ If the operands have different signs, for example op0_low < 0 and
+ op1_low >= 0, the instruction treats the most significant bit of
+ op0_low as a sign bit instead of a bit with significance
+ 2**(BITS_PER_WORD-1), i.e. the instruction multiplies op1_low
+ with 2**BITS_PER_WORD - op0_low, and two's complements the
+ result. Conclusion: We need to add op1_low * 2**BITS_PER_WORD to
+ the result.
+
+ Similarly, if both operands are negative, we need to add
+ (op0_low + op1_low) * 2**BITS_PER_WORD.
+
+ We use a trick to adjust quickly. We logically shift op0_low right
+ (op1_low) BITS_PER_WORD-1 steps to get 0 or 1, and add this to
+ op0_high (op1_high) before it is used to calculate 2b (2a). If no
+ logical shift exists, we do an arithmetic right shift and subtract
+ the 0 or -1. */
+
+ static rtx
+ expand_doubleword_mult (enum machine_mode mode, rtx op0, rtx op1, rtx target,
+ bool umulp, enum optab_methods methods)
+ {
+ int low = (WORDS_BIG_ENDIAN ? 1 : 0);
+ int high = (WORDS_BIG_ENDIAN ? 0 : 1);
+ rtx wordm1 = umulp ? NULL_RTX : GEN_INT (BITS_PER_WORD - 1);
+ rtx product, adjust, product_high, temp;
+
+ rtx op0_high = operand_subword_force (op0, high, mode);
+ rtx op0_low = operand_subword_force (op0, low, mode);
+ rtx op1_high = operand_subword_force (op1, high, mode);
+ rtx op1_low = operand_subword_force (op1, low, mode);
+
+ /* If we're using an unsigned multiply to directly compute the product
+ of the low-order words of the operands and perform any required
+ adjustments of the operands, we begin by trying two more multiplications
+ and then computing the appropriate sum.
+
+ We have checked above that the required addition is provided.
+ Full-word addition will normally always succeed, especially if
+ it is provided at all, so we don't worry about its failure. The
+ multiplication may well fail, however, so we do handle that. */
+
+ if (!umulp)
+ {
+ /* ??? This could be done with emit_store_flag where available. */
+ temp = expand_binop (word_mode, lshr_optab, op0_low, wordm1,
+ NULL_RTX, 1, methods);
+ if (temp)
+ op0_high = expand_binop (word_mode, add_optab, op0_high, temp,
+ op0_high, 0, OPTAB_DIRECT);
+ else
+ {
+ temp = expand_binop (word_mode, ashr_optab, op0_low, wordm1,
+ NULL_RTX, 0, methods);
+ if (!temp)
+ return NULL_RTX;
+ op0_high = expand_binop (word_mode, sub_optab, op0_high, temp,
+ op0_high, 0, OPTAB_DIRECT);
+ }
+
+ if (!op0_high)
+ return NULL_RTX;
+ }
+
+ adjust = expand_binop (word_mode, smul_optab, op0_high, op1_low,
+ NULL_RTX, 0, OPTAB_DIRECT);
+ if (!adjust)
+ return NULL_RTX;
+
+ /* OP0_HIGH should now be dead. */
+
+ if (!umulp)
+ {
+ /* ??? This could be done with emit_store_flag where available. */
+ temp = expand_binop (word_mode, lshr_optab, op1_low, wordm1,
+ NULL_RTX, 1, methods);
+ if (temp)
+ op1_high = expand_binop (word_mode, add_optab, op1_high, temp,
+ op1_high, 0, OPTAB_DIRECT);
+ else
+ {
+ temp = expand_binop (word_mode, ashr_optab, op1_low, wordm1,
+ NULL_RTX, 0, methods);
+ if (!temp)
+ return NULL_RTX;
+ op1_high = expand_binop (word_mode, sub_optab, op1_high, temp,
+ op1_high, 0, OPTAB_DIRECT);
+ }
+
+ if (!op1_high)
+ return NULL_RTX;
+ }
+
+ temp = expand_binop (word_mode, smul_optab, op1_high, op0_low,
+ NULL_RTX, 0, OPTAB_DIRECT);
+ if (!temp)
+ return NULL_RTX;
+
+ /* OP1_HIGH should now be dead. */
+
+ adjust = expand_binop (word_mode, add_optab, adjust, temp,
+ adjust, 0, OPTAB_DIRECT);
+
+ if (target && !REG_P (target))
+ target = NULL_RTX;
+
+ if (umulp)
+ product = expand_binop (mode, umul_widen_optab, op0_low, op1_low,
+ target, 1, OPTAB_DIRECT);
+ else
+ product = expand_binop (mode, smul_widen_optab, op0_low, op1_low,
+ target, 1, OPTAB_DIRECT);
+
+ if (!product)
+ return NULL_RTX;
+
+ product_high = operand_subword (product, high, 1, mode);
+ adjust = expand_binop (word_mode, add_optab, product_high, adjust,
+ REG_P (product_high) ? product_high : adjust,
+ 0, OPTAB_DIRECT);
+ emit_move_insn (product_high, adjust);
+ return product;
+ }
+
/* Wrapper around expand_binop which takes an rtx code to specify
the operation to perform, not an optab pointer. All other
arguments are the same. */
*************** expand_binop (enum machine_mode mode, op
*** 1399,1595 ****
delete_insns_since (last);
}
! /* If we want to multiply two two-word values and have normal and widening
! multiplies of single-word values, we can do this with three smaller
! multiplications. Note that we do not make a REG_NO_CONFLICT block here
! because we are not operating on one word at a time.
!
! The multiplication proceeds as follows:
! _______________________
! [__op0_high_|__op0_low__]
! _______________________
! * [__op1_high_|__op1_low__]
! _______________________________________________
! _______________________
! (1) [__op0_low__*__op1_low__]
! _______________________
! (2a) [__op0_low__*__op1_high_]
! _______________________
! (2b) [__op0_high_*__op1_low__]
! _______________________
! (3) [__op0_high_*__op1_high_]
!
!
! This gives a 4-word result. Since we are only interested in the
! lower 2 words, partial result (3) and the upper words of (2a) and
! (2b) don't need to be calculated. Hence (2a) and (2b) can be
! calculated using non-widening multiplication.
!
! (1), however, needs to be calculated with an unsigned widening
! multiplication. If this operation is not directly supported we
! try using a signed widening multiplication and adjust the result.
! This adjustment works as follows:
!
! If both operands are positive then no adjustment is needed.
!
! If the operands have different signs, for example op0_low < 0 and
! op1_low >= 0, the instruction treats the most significant bit of
! op0_low as a sign bit instead of a bit with significance
! 2**(BITS_PER_WORD-1), i.e. the instruction multiplies op1_low
! with 2**BITS_PER_WORD - op0_low, and two's complements the
! result. Conclusion: We need to add op1_low * 2**BITS_PER_WORD to
! the result.
!
! Similarly, if both operands are negative, we need to add
! (op0_low + op1_low) * 2**BITS_PER_WORD.
!
! We use a trick to adjust quickly. We logically shift op0_low right
! (op1_low) BITS_PER_WORD-1 steps to get 0 or 1, and add this to
! op0_high (op1_high) before it is used to calculate 2b (2a). If no
! logical shift exists, we do an arithmetic right shift and subtract
! the 0 or -1. */
if (binoptab == smul_optab
&& class == MODE_INT
&& GET_MODE_SIZE (mode) == 2 * UNITS_PER_WORD
&& smul_optab->handlers[(int) word_mode].insn_code != CODE_FOR_nothing
! && add_optab->handlers[(int) word_mode].insn_code != CODE_FOR_nothing
! && ((umul_widen_optab->handlers[(int) mode].insn_code
! != CODE_FOR_nothing)
! || (smul_widen_optab->handlers[(int) mode].insn_code
! != CODE_FOR_nothing)))
! {
! int low = (WORDS_BIG_ENDIAN ? 1 : 0);
! int high = (WORDS_BIG_ENDIAN ? 0 : 1);
! rtx op0_high = operand_subword_force (op0, high, mode);
! rtx op0_low = operand_subword_force (op0, low, mode);
! rtx op1_high = operand_subword_force (op1, high, mode);
! rtx op1_low = operand_subword_force (op1, low, mode);
! rtx product = 0;
! rtx op0_xhigh = NULL_RTX;
! rtx op1_xhigh = NULL_RTX;
!
! /* If the target is the same as one of the inputs, don't use it. This
! prevents problems with the REG_EQUAL note. */
! if (target == op0 || target == op1
! || (target != 0 && !REG_P (target)))
! target = 0;
!
! /* Multiply the two lower words to get a double-word product.
! If unsigned widening multiplication is available, use that;
! otherwise use the signed form and compensate. */
! if (umul_widen_optab->handlers[(int) mode].insn_code != CODE_FOR_nothing)
{
! product = expand_binop (mode, umul_widen_optab, op0_low, op1_low,
! target, 1, OPTAB_DIRECT);
!
! /* If we didn't succeed, delete everything we did so far. */
! if (product == 0)
delete_insns_since (last);
- else
- op0_xhigh = op0_high, op1_xhigh = op1_high;
}
! if (product == 0
&& smul_widen_optab->handlers[(int) mode].insn_code
! != CODE_FOR_nothing)
{
! rtx wordm1 = GEN_INT (BITS_PER_WORD - 1);
! product = expand_binop (mode, smul_widen_optab, op0_low, op1_low,
! target, 1, OPTAB_DIRECT);
! op0_xhigh = expand_binop (word_mode, lshr_optab, op0_low, wordm1,
! NULL_RTX, 1, next_methods);
! if (op0_xhigh)
! op0_xhigh = expand_binop (word_mode, add_optab, op0_high,
! op0_xhigh, op0_xhigh, 0, next_methods);
! else
! {
! op0_xhigh = expand_binop (word_mode, ashr_optab, op0_low, wordm1,
! NULL_RTX, 0, next_methods);
! if (op0_xhigh)
! op0_xhigh = expand_binop (word_mode, sub_optab, op0_high,
! op0_xhigh, op0_xhigh, 0,
! next_methods);
! }
!
! op1_xhigh = expand_binop (word_mode, lshr_optab, op1_low, wordm1,
! NULL_RTX, 1, next_methods);
! if (op1_xhigh)
! op1_xhigh = expand_binop (word_mode, add_optab, op1_high,
! op1_xhigh, op1_xhigh, 0, next_methods);
! else
! {
! op1_xhigh = expand_binop (word_mode, ashr_optab, op1_low, wordm1,
! NULL_RTX, 0, next_methods);
! if (op1_xhigh)
! op1_xhigh = expand_binop (word_mode, sub_optab, op1_high,
! op1_xhigh, op1_xhigh, 0,
! next_methods);
! }
}
! /* If we have been able to directly compute the product of the
! low-order words of the operands and perform any required adjustments
! of the operands, we proceed by trying two more multiplications
! and then computing the appropriate sum.
!
! We have checked above that the required addition is provided.
! Full-word addition will normally always succeed, especially if
! it is provided at all, so we don't worry about its failure. The
! multiplication may well fail, however, so we do handle that. */
!
! if (product && op0_xhigh && op1_xhigh)
{
! rtx product_high = operand_subword (product, high, 1, mode);
! rtx temp = expand_binop (word_mode, binoptab, op0_low, op1_xhigh,
! NULL_RTX, 0, OPTAB_DIRECT);
!
! if (!REG_P (product_high))
! product_high = force_reg (word_mode, product_high);
!
! if (temp != 0)
! temp = expand_binop (word_mode, add_optab, temp, product_high,
! product_high, 0, next_methods);
!
! if (temp != 0 && temp != product_high)
! emit_move_insn (product_high, temp);
!
! if (temp != 0)
! temp = expand_binop (word_mode, binoptab, op1_low, op0_xhigh,
! NULL_RTX, 0, OPTAB_DIRECT);
!
! if (temp != 0)
! temp = expand_binop (word_mode, add_optab, temp,
! product_high, product_high,
! 0, next_methods);
!
! if (temp != 0 && temp != product_high)
! emit_move_insn (product_high, temp);
!
! emit_move_insn (operand_subword (product, high, 1, mode), product_high);
!
! if (temp != 0)
{
! if (mov_optab->handlers[(int) mode].insn_code != CODE_FOR_nothing)
! {
! temp = emit_move_insn (product, product);
! set_unique_reg_note (temp,
! REG_EQUAL,
! gen_rtx_fmt_ee (MULT, mode,
! copy_rtx (op0),
! copy_rtx (op1)));
! }
!
! return product;
}
}
-
- /* If we get here, we couldn't do it for some reason even though we
- originally thought we could. Delete anything we've emitted in
- trying to do it. */
-
- delete_insns_since (last);
}
/* It can't be open-coded in this mode.
--- 1561,1611 ----
delete_insns_since (last);
}
! /* Attempt to synthesize double word multiplies using a sequence of word
! mode multiplications. We first attempt to generate a sequence using a
! more efficient unsigned widening multiply, and if that fails we then
! try using a signed widening multiply. */
if (binoptab == smul_optab
&& class == MODE_INT
&& GET_MODE_SIZE (mode) == 2 * UNITS_PER_WORD
&& smul_optab->handlers[(int) word_mode].insn_code != CODE_FOR_nothing
! && add_optab->handlers[(int) word_mode].insn_code != CODE_FOR_nothing)
! {
! rtx product = NULL_RTX;
! if (umul_widen_optab->handlers[(int) mode].insn_code
! != CODE_FOR_nothing)
{
! product = expand_doubleword_mult (mode, op0, op1, target,
! true, methods);
! if (!product)
delete_insns_since (last);
}
! if (product == NULL_RTX
&& smul_widen_optab->handlers[(int) mode].insn_code
! != CODE_FOR_nothing)
{
! product = expand_doubleword_mult (mode, op0, op1, target,
! false, methods);
! if (!product)
! delete_insns_since (last);
}
! if (product != NULL_RTX)
{
! if (mov_optab->handlers[(int) mode].insn_code != CODE_FOR_nothing)
{
! temp = emit_move_insn (target ? target : product, product);
! set_unique_reg_note (temp,
! REG_EQUAL,
! gen_rtx_fmt_ee (MULT, mode,
! copy_rtx (op0),
! copy_rtx (op1)));
}
+ return product;
}
}
/* It can't be open-coded in this mode.
Roger
--
Roger Sayle, E-mail: roger@eyesopen.com
OpenEye Scientific Software, WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833