Compiling: unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) { return arg1 * arg2; } with gcc -O2 -fomit-frame-pointer generates: pushl %ebp pushl %edi pushl %ebx pushl %edi pushl %edi movl 32(%esp), %ecx movl 24(%esp), %eax movl 36(%esp), %ebx movl 24(%esp), %edi mull %ecx imull 28(%esp), %ecx imull %ebx, %edi movl %edx, %ebp movl %eax, (%esp) addl %edi, %ebp movl (%esp), %eax leal (%ebp,%ecx), %ecx movl %ecx, 4(%esp) movl 4(%esp), %edx popl %ecx popl %ebx popl %ebx popl %edi popl %ebp ret ICC -O2 generates much smaller code: movl 8(%esp), %ecx #516.18 imull 12(%esp), %ecx #516.18 movl 16(%esp), %eax #516.18 imull 4(%esp), %eax #516.18 addl %eax, %ecx #516.18 movl 4(%esp), %eax #516.18 mull 12(%esp) #516.18 addl %ecx, %edx #516.18 ret #516.18
Confirmed. The code is know a little bit better: LLM: pushl %ebp pushl %edi pushl %ebx subl $8, %esp movl 32(%esp), %ecx movl 36(%esp), %ebx movl 24(%esp), %eax mull %ecx movl %eax, (%esp) movl 24(%esp), %edi imull %ebx, %edi movl %edx, %ebp addl %edi, %ebp imull 28(%esp), %ecx leal (%ebp,%ecx), %ecx movl %ecx, 4(%esp) movl (%esp), %eax movl 4(%esp), %edx addl $8, %esp popl %ebx popl %edi popl %ebp ret .size LLM, .-LLM .ident "GCC: (GNU) 4.0.0 20041003 (experimental)" On 2.95, we used to emit: LLM: pushl %ebp pushl %edi pushl %esi pushl %ebx movl 20(%esp),%ecx movl 24(%esp),%ebx movl 28(%esp),%eax movl 32(%esp),%esi mull %ecx movl %eax,%edi movl 28(%esp),%eax imull %ecx,%esi imull %ebx,%eax movl %edx,%ebp addl %esi,%ebp addl %eax,%ebp movl %edi,%eax movl %ebp,%edx popl %ebx popl %esi popl %edi popl %ebp ret .Lfe1: .size LLM,.Lfe1-LLM .ident "GCC: (GNU) 2.95.3 20010315 (release)" Not a regression, and ICC does so much better.
Subject: Bug 17236 CVSROOT: /cvs/gcc Module name: gcc Changes by: sayle@gcc.gnu.org 2005-03-14 18:24:15 Modified files: gcc : ChangeLog optabs.c Log message: PR rtl-optimization/17236 * optabs.c (expand_doubleword_mult): New helper function split out from expand_binop. Permute the order in which instructions are emitted to minimize the number of simultaneously live registers. (expand_binop): Call expand_doubleword_mult to synthesize a double word multiplication. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7846&r2=2.7847 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/optabs.c.diff?cvsroot=gcc&r1=1.260&r2=1.261
Roger explains what else needs to be done here: http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01386.html Right now, after his patch, mainline generates this code: pushl %edi pushl %esi pushl %ebx movl 16(%esp), %eax movl 20(%esp), %edx movl 24(%esp), %ecx movl 28(%esp), %ebx movl %edx, %esi imull %ecx, %esi movl %ebx, %edi imull %eax, %edi addl %edi, %esi mull %ecx leal (%esi,%edx), %edx popl %ebx popl %esi popl %edi ret
Uros did some additional comments: http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01427.html
It looks like Ian's recent subreg lowering pass patch has improved code generation on this testcase. Previously, we'd spill three integer registers to the stack for "LLM", we're now down to two. [A significant improvement from the five we spilled when this bug was reported] Before: LLM: subl $12, %esp movl %ebx, (%esp) movl 28(%esp), %edx movl 20(%esp), %ebx movl 16(%esp), %ecx movl 24(%esp), %eax movl %esi, 4(%esp) movl %edx, %esi movl %edi, 8(%esp) movl %ebx, %edi movl (%esp), %ebx imull %ecx, %esi imull %eax, %edi mull %ecx addl %edi, %esi movl 8(%esp), %edi leal (%esi,%edx), %edx movl 4(%esp), %esi addl $12, %esp ret After: LLM: subl $8, %esp movl %ebx, (%esp) movl 20(%esp), %eax movl %esi, 4(%esp) movl 24(%esp), %ecx movl 12(%esp), %esi movl 16(%esp), %ebx imull %esi, %ecx imull %eax, %ebx mull %esi movl 4(%esp), %esi addl %ebx, %ecx movl (%esp), %ebx addl $8, %esp leal (%ecx,%edx), %edx ret
*** Bug 6585 has been marked as a duplicate of this bug. ***
The generated code has changed a lot recently, though it still uses two spills: pushl %esi pushl %ebx movl 12(%esp), %ebx ; load alow movl 20(%esp), %esi ; load blow movl 24(%esp), %ecx ; load bhigh imull %ebx, %ecx ; bhigh = bhigh * alow movl 16(%esp), %eax ; load ahigh imull %esi, %eax ; ahigh = ahigh * blow addl %eax, %ecx movl %esi, %eax ; need to move a low part in %eax mull %ebx leal (%ecx,%edx), %esi ; what the heck, a simple addl could do! movl %esi, %edx popl %ebx popl %esi The important thing is that the adds are one before, and one after the multiply. And the RTL before reload is very nice! 28 r64:SI=[argp:SI] 30 r66:SI=[argp:SI+0x8] 7 {r62:SI=[argp:SI+0xc]*r64:SI;clobber flags:CC;} 8 {r63:SI=[argp:SI+0x4]*r66:SI;clobber flags:CC;} 9 {r62:SI=r62:SI+r63:SI;clobber flags:CC;} 10 {r61:DI=zero_extend(r66:SI)*zero_extend(r64:SI);clobber flags:CC;} 12 {r61:DI#4=r62:SI+r61:DI#4;clobber flags:CC;} That could be basically mov (blow), %eax mov (alow), %ecx imul (bhigh), %eax, %ebx ;needs reload of (bh) in %ebx imul (ahigh), %ecx, %edx ;needs reload of (ah) in %edx addl %edx, %ebx mul %ecx addl %ebx, %edx So, if we are content with a single spill, there is now mainly a register allocation problem. For zero spills, we would need rematerialization of either alow or blow, as in: mov (blow), %eax imul (bhigh), %eax, %ecx ;needs reload of (bh) in %ecx imul (ahigh), (alow), %edx ;needs reload of (ah) in %edx addl %edx, %ecx mul (al) addl %ebx, %edx So, for now let's aim at having one spill. Then the main problem is that gcc should try to keep one of alow or blow in %eax, in order to reuse it for the mull. Then it could do with a single spill, like this: pushl %ebx movl 12(%esp), %eax ; load alow movl 20(%esp), %ecx ; load blow movl 24(%esp), %ebx ; load bhigh, input reload imull %eax, %ebx movl 16(%esp), %edx ; load ahigh, input reload imull %ecx, %edx addl %edx, %ebx mul %ecx addl %ebx, %edx popl %ebx
For -mregparm=2 the code is this: pushl %ebx movl 8(%esp), %ebx movl 12(%esp), %ecx imull %ebx, %edx imull %eax, %ecx addl %edx, %ecx mull %ebx popl %ebx leal (%ecx,%edx), %edx ret It is better only for luck: the low part of "a" is already in %eax, so that regclass decides for the "right" preference. Register 64 pref AREG, else GENERAL_REGS Paolo
Uros mentioned offlist that he wanted to hijack fwprop to always propagate stack slots into instructions. It would be a relatively useful piece of infrastructure to have a flag in MEMs that marks on-stack MEMs, because other MEMs definitely must not be propagated blindly.
> It would be a relatively useful piece of infrastructure to have a flag in MEMs > that marks on-stack MEMs, because other MEMs definitely must not be propagated > blindly. Depending on your needs, MEM_NOTRAP_P could be a good approximation.
Generated code for a similar example is just plain stupid: --cut here-- int test(long long a, long long b) { return a * b; } --cut here-- gcc -O3: test: pushl %ebp movl %esp, %ebp subl $8, %esp movl %esi, 4(%esp) movl 16(%ebp), %esi movl %ebx, (%esp) movl 8(%ebp), %ebx movl %esi, %eax movl 4(%esp), %esi mull %ebx movl (%esp), %ebx movl %ebp, %esp popl %ebp ret gcc-4.0 is a little less creative and creates: test: pushl %ebp movl %esp, %ebp pushl %edi pushl %esi pushl %ebx movl 8(%ebp), %eax mull 16(%ebp) popl %ebx popl %esi popl %edi leave ret
The problem in comment #11 is that GCC generates a widening multiply, and cannot remove the DImode operations until after register allocation (!). While the root cause is a deficiency in RTL-level DCE, I suggest filing a separate bug, because it could also be fixed on the tree-level.
I think tree level does the right thing, TER fixes this up and expand_expr is called with return (int) (b * a) Later on expand_expr is called with <mult_expr 0x2aaaae9032c0 type <integer_type 0x2aaaae937840 long long int DI size <integer_cst 0x2aaaae927c60 constant invariant 64> unit size <integer_cst 0x2aaaae927c90 constant invariant 8> align 64 symtab 0 alias set 2 canonical type 0x2aaaae937840 precision 64 min <integer_cst 0x2aaaae927bd0 -9223372036854775808> max <integer_cst 0x2aaaae927c00 9223372036854775807>> arg 0 <parm_decl 0x2aaaae92d2d0 b type <integer_type 0x2aaaae937840 long long int> used DI file pr17236.c line 2 col 30 size <integer_cst 0x2aaaae927c60 64> unit size <integer_cst 0x2aaaae927c90 8> align 64 context <function_decl 0x2aaaae9e8a90 test> initial <integer_type 0x2aaaae937840 long long int> (reg/v:DI 60 [ b ]) arg-type <integer_type 0x2aaaae937840 long long int> incoming-rtl (mem/c/i:DI (plus:SI (reg/f:SI 53 virtual-incoming-args) (const_int 8 [0x8])) [2 b+0 S8 A32])> arg 1 <parm_decl 0x2aaaae92d240 a type <integer_type 0x2aaaae937840 long long int> used DI file pr17236.c line 2 col 17 size <integer_cst 0x2aaaae927c60 64> unit size <integer_cst 0x2aaaae927c90 8> align 64 context <function_decl 0x2aaaae9e8a90 test> initial <integer_type 0x2aaaae937840 long long int> (reg/v:DI 59 [ a ]) arg-type <integer_type 0x2aaaae937840 long long int> incoming-rtl (mem/c/i:DI (reg/f:SI 53 virtual-incoming-args) [2 a+0 S8 A32]) chain <parm_decl 0x2aaaae92d2d0 b>>> and tmode SImode, still enough info to choose a better multiply.
The bug with 64*64->32 multiplication is now PR34522.
(In reply to comment #7) > mull %ebx > leal (%ecx,%edx), %esi ; what the heck, a simple addl could do! > movl %esi, %edx Something disturbs RA to emit two DImode moves: (insn:HI 10 36 37 2 m.c:2 (parallel [ (set (reg:DI 0 ax) (mult:DI (zero_extend:DI (reg:SI 0 ax)) (zero_extend:DI (reg:SI 3 bx [orig:64 a ] [64])))) (clobber (reg:CC 17 flags)) ]) 304 {*umulsidi3_insn} (nil)) (insn 37 10 11 2 m.c:2 (set (reg:DI 3 bx [61]) (reg:DI 0 ax)) 88 {*movdi_2} (nil)) (note:HI 11 37 12 2 NOTE_INSN_DELETED) (insn:HI 12 11 18 2 m.c:2 (parallel [ (set (reg:SI 4 si [+4 ]) (plus:SI (reg:SI 2 cx [62]) (reg:SI 4 si [+4 ]))) (clobber (reg:CC 17 flags)) ]) 249 {*addsi_1} (nil)) (insn:HI 18 12 24 2 m.c:4 (set (reg/i:DI 0 ax [ <result> ]) (reg:DI 3 bx [61])) 88 {*movdi_2} (nil)) (insn 24 18 33 2 m.c:4 (use (reg/i:DI 0 ax [ <result> ])) -1 (nil)) Note two moves [(insn 36) and (insn 37)] around (insn 12).
(In reply to comment #15) > Note two moves [(insn 36) and (insn 37)] around (insn 12). Bah. This is the correct sequence, around (insn 10) that seems to be the root of all problems: (insn:HI 9 8 36 2 m.c:2 (parallel [ (set (reg:SI 2 cx [62]) (plus:SI (reg:SI 2 cx [62]) (reg:SI 0 ax [63]))) (clobber (reg:CC 17 flags)) ]) 249 {*addsi_1} (nil)) (insn 36 9 10 2 m.c:2 (set (reg:SI 0 ax) (reg:SI 4 si [orig:66 b ] [66])) 47 {*movsi_1} (nil)) (insn:HI 10 36 37 2 m.c:2 (parallel [ (set (reg:DI 0 ax) (mult:DI (zero_extend:DI (reg:SI 0 ax)) (zero_extend:DI (reg:SI 3 bx [orig:64 a ] [64])))) (clobber (reg:CC 17 flags)) ]) 304 {*umulsidi3_insn} (nil)) (insn 37 10 11 2 m.c:2 (set (reg:DI 3 bx [61]) (reg:DI 0 ax)) 88 {*movdi_2} (nil)) DImode AX as found in (insn 10) could simply be propagated up and down RTL stream as SImode destination of (insn 8) and SImode source of (insn 12). For some reason, RA is afraid to allocate registers in mixed modes.
With this patch, GCC gets the preferences right, but it does not affect code generation. Index: regclass.c =================================================================== --- regclass.c (revision 130928) +++ regclass.c (working copy) @@ -1651,9 +1651,15 @@ record_reg_classes (int n_alts, int n_op [(unsigned char) reg_pref[REGNO (op)].prefclass] [(int) classes[i]]); - if (REGNO (ops[i]) != REGNO (ops[j]) - && ! find_reg_note (insn, REG_DEAD, op)) - alt_cost += 2; + if (REGNO (ops[i]) != REGNO (ops[j])) + { + /* If the pseudo dies, tying it to the duplicate + operand can be advantageous. */ + if (find_reg_note (insn, REG_DEAD, op)) + pp->cost[classes[j]]--; + else + alt_cost += 2; + } /* This is in place of ordinary cost computation for this operand, so skip to the end of the
Another baby step can be performed by: Index: optabs.c =================================================================== --- optabs.c (revision 131053) +++ optabs.c (working copy) @@ -1245,7 +1245,7 @@ expand_doubleword_mult (enum machine_mod return NULL_RTX; } - adjust = expand_binop (word_mode, smul_optab, op0_high, op1_low, + adjust = expand_binop (word_mode, smul_optab, op1_low, op0_high, NULL_RTX, 0, OPTAB_DIRECT); if (!adjust) return NULL_RTX; @@ -1274,7 +1274,7 @@ expand_doubleword_mult (enum machine_mod return NULL_RTX; } - temp = expand_binop (word_mode, smul_optab, op1_high, op0_low, + temp = expand_binop (word_mode, smul_optab, op0_low, op1_high, NULL_RTX, 0, OPTAB_DIRECT); if (!temp) return NULL_RTX; This patch generates (note mem operands to imull): test: subl $8, %esp movl %ebx, (%esp) movl 12(%esp), %ebx movl %esi, 4(%esp) movl 20(%esp), %esi movl %ebx, %ecx movl %esi, %eax imull 24(%esp), %ecx imull 16(%esp), %eax addl %eax, %ecx movl %esi, %eax mull %ebx movl (%esp), %ebx leal (%ecx,%edx), %esi movl %esi, %edx movl 4(%esp), %esi addl $8, %esp ret but since mulsi3 is defined as: (define_insn "*mulsi3_1" [(set (match_operand:SI 0 "register_operand" "=r,r,r") (mult:SI (match_operand:SI 1 "nonimmediate_operand" "%rm,rm,0") (match_operand:SI 2 "general_operand" "K,i,mr"))) (clobber (reg:CC FLAGS_REG))] the question is, why didn't RA perform operand reversal by itself. Currently it converts _both_ SI multiplies from: (insn:HI 8 7 9 2 m.c:14 (parallel [ (set (reg:SI 63) (mult:SI (mem/c/i:SI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [2 a+4 S4 A32]) (reg:SI 66 [ b ]))) (clobber (reg:CC 17 flags)) ]) 214 {*mulsi3_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) into --cut here-- (insn 35 7 8 2 m.c:14 (set (reg:SI 0 ax [63]) (mem/c/i:SI (plus:SI (reg/f:SI 7 sp) (const_int 16 [0x10])) [2 a+4 S4 A32])) 41 {*movsi_1} (nil)) (insn:HI 8 35 9 2 m.c:14 (parallel [ (set (reg:SI 0 ax [63]) (mult:SI (reg:SI 0 ax [63]) (reg:SI 4 si [orig:66 b ] [66]))) (clobber (reg:CC 17 flags)) ]) 214 {*mulsi3_1} (nil)) --cut here--
Created attachment 14792 [details] patch to almost fix the bug With this patch: 1) local-alloc first tries to allocate registers that go into small classes 2) regclass tries to use the same class for an output and an input that dies 3) i386 modes_tiable_p allows tying SImode with DImode (but not vice versa) I get now this: pushl %esi pushl %ebx movl 12(%esp), %ecx movl 20(%esp), %eax movl 24(%esp), %ebx imull %ecx, %ebx movl 16(%esp), %esi imull %eax, %esi addl %esi, %ebx mull %ecx leal (%ebx,%edx), %edx popl %ebx popl %esi which has no moves. It still spills two registers because it is not able to use %edx (I'll look into it later).
There is a big catch-22. When GCC ties one of regs 64/66 with reg 61, it enlarges reg 61's live range to cover the live range of the tied range. When it does this, it basically locks up %edx for the whole live range of reg 61. :-(
Created attachment 14793 [details] two hunks only from the previous patch Indeed, if I only use the regclass.c and local-alloc.c hunks, I get only one spill! pushl %ebx movl 8(%esp), %edx movl 16(%esp), %eax movl 20(%esp), %ecx imull %edx, %ecx movl 12(%esp), %ebx imull %eax, %ebx addl %ebx, %ecx mull %edx leal (%ecx,%edx), %edx popl %ebx and no reg-reg move. With Uros' patch it does use memory imull's, but also adds reg-reg moves. The next step should be to figure out why (with my patch and without his) reload feels the need to put the memory operands from insns 7 and 8 into regs.
(In reply to comment #21) > Indeed, if I only use the regclass.c and local-alloc.c hunks, I get only one > spill! > > pushl %ebx > movl 8(%esp), %edx > movl 16(%esp), %eax > movl 20(%esp), %ecx > imull %edx, %ecx > movl 12(%esp), %ebx > imull %eax, %ebx > addl %ebx, %ecx > mull %edx > leal (%ecx,%edx), %edx > popl %ebx > > and no reg-reg move. Excellent! I think that we should fwprop stack arguments into the insns to further optimize the code. According to pentopt.pdf, there is no penalty at all (at least for P4+) when one of the insn operands is a memory reference. This would relieve register pressure considerably.
Created attachment 14794 [details] teach combine that reg = op(reg, mem) is better Since combine operates on the whole pattern, it can be taught the trick that swap_commutative_operands_with_target does in optabs.c. This is the "right way" to do the hack Uros proposed in comment #18. Note that optabs.c does not get it because it creates reg=subreg*subreg. Then CSE creates SImode memory loads from the subreg, and combine creates reg=mem*reg. With this patch, combine creates reg=reg*mem instead.
The patch in comment #23 might even be suitable for GCC 4.3 ...
Note that "fwprop" is not an exact term, because there *is* a memory load in each multiplication, and propagating a second memory operand will create an invalid insn. You may try to add a split from reg=op(mem1, mem2) to reg=mem1;reg=reg*mem2, but anyway, it won't help propagating into the mull, since it is not LOG_LINKS-related to the load. The problem is that we end up with having to reload *anyway*. If we had a way to express the relationship between operands 0 and 1 of the multiplication *before reload*, probably the RTL optimizations could do much more.
I'm starting a SPEC run on the overall patch
I screwed up so I have to rerun most of SPECfp2000, but the results seem a wash. Anybody can fire the patch I'll attach soon on a wide range of machines?
Created attachment 14800 [details] combined patch
ira branch produces the same code as with my patch.
Current mainline (4.4.0 20080906) produces: pushl %ebx movl 8(%ebp), %eax movl 16(%ebp), %edx movl 20(%ebp), %ecx movl 12(%ebp), %ebx imull %eax, %ecx imull %edx, %ebx mull %edx addl %ebx, %ecx popl %ebx leal (%ecx,%edx), %edx So, fixed.