GCC Bugzilla has been upgraded from version 4.4.9 to 5.0rc3. If you see any problem, please report it to bug 64968.
Bug 17236 - inefficient code for long long multiply on x86
Summary: inefficient code for long long multiply on x86
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: 4.4.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization
: 6585 (view as bug list)
Depends on: 19398
Blocks: 4.4pending
  Show dependency treegraph
 
Reported: 2004-08-30 05:50 UTC by Dan Nicolaescu
Modified: 2008-09-06 16:19 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 2.95.3 3.0.4 3.3.3 3.4.1 4.0.0
Last reconfirmed: 2006-01-21 03:08:24


Attachments
patch to almost fix the bug (1.50 KB, patch)
2007-12-19 12:13 UTC, Paolo Bonzini
Details | Diff
two hunks only from the previous patch (662 bytes, patch)
2007-12-19 12:43 UTC, Paolo Bonzini
Details | Diff
teach combine that reg = op(reg, mem) is better (1.14 KB, patch)
2007-12-19 13:36 UTC, Paolo Bonzini
Details | Diff
combined patch (1.87 KB, patch)
2007-12-20 14:15 UTC, Paolo Bonzini
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Nicolaescu 2004-08-30 05:50:39 UTC
Compiling:
unsigned long long LLM(unsigned long long arg1, unsigned long long arg2)
{
  return arg1 *  arg2;
}

with gcc -O2 -fomit-frame-pointer generates:

        pushl   %ebp
        pushl   %edi
        pushl   %ebx
        pushl   %edi
        pushl   %edi
        movl    32(%esp), %ecx
        movl    24(%esp), %eax
        movl    36(%esp), %ebx
        movl    24(%esp), %edi
        mull    %ecx
        imull   28(%esp), %ecx
        imull   %ebx, %edi
        movl    %edx, %ebp
        movl    %eax, (%esp)
        addl    %edi, %ebp
        movl    (%esp), %eax
        leal    (%ebp,%ecx), %ecx
        movl    %ecx, 4(%esp)
        movl    4(%esp), %edx
        popl    %ecx
        popl    %ebx
        popl    %ebx
        popl    %edi
        popl    %ebp
        ret

ICC -O2 generates much smaller code:

        movl      8(%esp), %ecx                                 #516.18
        imull     12(%esp), %ecx                                #516.18
        movl      16(%esp), %eax                                #516.18
        imull     4(%esp), %eax                                 #516.18
        addl      %eax, %ecx                                    #516.18
        movl      4(%esp), %eax                                 #516.18
        mull      12(%esp)                                      #516.18
        addl      %ecx, %edx                                    #516.18
        ret                                                     #516.18
Comment 1 Giovanni Bajo 2004-10-06 23:27:47 UTC
Confirmed. The code is know a little bit better:

LLM:
     pushl      %ebp
     pushl      %edi
     pushl      %ebx
     subl       $8, %esp
     movl       32(%esp), %ecx
     movl       36(%esp), %ebx
     movl       24(%esp), %eax
     mull       %ecx
     movl       %eax, (%esp)
     movl       24(%esp), %edi
     imull      %ebx, %edi
     movl       %edx, %ebp
     addl       %edi, %ebp
     imull      28(%esp), %ecx
     leal       (%ebp,%ecx), %ecx
     movl       %ecx, 4(%esp)
     movl       (%esp), %eax
     movl       4(%esp), %edx
     addl       $8, %esp
     popl       %ebx
     popl       %edi
     popl       %ebp
     ret
     .size      LLM, .-LLM
     .ident     "GCC: (GNU) 4.0.0 20041003 (experimental)"




On 2.95, we used to emit:

LLM:
     pushl %ebp
     pushl %edi
     pushl %esi
     pushl %ebx
     movl 20(%esp),%ecx
     movl 24(%esp),%ebx
     movl 28(%esp),%eax
     movl 32(%esp),%esi
     mull %ecx
     movl %eax,%edi
     movl 28(%esp),%eax
     imull %ecx,%esi
     imull %ebx,%eax
     movl %edx,%ebp
     addl %esi,%ebp
     addl %eax,%ebp
     movl %edi,%eax
     movl %ebp,%edx
     popl %ebx
     popl %esi
     popl %edi
     popl %ebp
     ret
.Lfe1:
     .size       LLM,.Lfe1-LLM
     .ident     "GCC: (GNU) 2.95.3 20010315 (release)"



Not a regression, and ICC does so much better.
Comment 2 CVS Commits 2005-03-14 18:24:22 UTC
Subject: Bug 17236

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	sayle@gcc.gnu.org	2005-03-14 18:24:15

Modified files:
	gcc            : ChangeLog optabs.c 

Log message:
	PR rtl-optimization/17236
	* optabs.c (expand_doubleword_mult): New helper function split out
	from expand_binop.  Permute the order in which instructions are
	emitted to minimize the number of simultaneously live registers.
	(expand_binop): Call expand_doubleword_mult to synthesize a double
	word multiplication.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7846&r2=2.7847
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/optabs.c.diff?cvsroot=gcc&r1=1.260&r2=1.261

Comment 3 Giovanni Bajo 2005-03-15 10:04:50 UTC
Roger explains what else needs to be done here:
http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01386.html

Right now, after his patch, mainline generates this code:

        pushl   %edi
        pushl   %esi
        pushl   %ebx
        movl    16(%esp), %eax
        movl    20(%esp), %edx
        movl    24(%esp), %ecx
        movl    28(%esp), %ebx
        movl    %edx, %esi
        imull   %ecx, %esi
        movl    %ebx, %edi
        imull   %eax, %edi
        addl    %edi, %esi
        mull    %ecx
        leal    (%esi,%edx), %edx
        popl    %ebx
        popl    %esi
        popl    %edi
        ret
Comment 4 Giovanni Bajo 2005-03-15 10:07:40 UTC
Uros did some additional comments:
http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01427.html
Comment 5 roger 2007-02-02 00:17:07 UTC
It looks like Ian's recent subreg lowering pass patch has improved code generation on this testcase.  Previously, we'd spill three integer registers to the stack for "LLM", we're now down to two.  [A significant improvement from the five we spilled when this bug was reported]

Before:

LLM:    subl    $12, %esp
        movl    %ebx, (%esp)
        movl    28(%esp), %edx
        movl    20(%esp), %ebx
        movl    16(%esp), %ecx
        movl    24(%esp), %eax
        movl    %esi, 4(%esp)
        movl    %edx, %esi
        movl    %edi, 8(%esp)
        movl    %ebx, %edi
        movl    (%esp), %ebx
        imull   %ecx, %esi
        imull   %eax, %edi
        mull    %ecx
        addl    %edi, %esi
        movl    8(%esp), %edi
        leal    (%esi,%edx), %edx
        movl    4(%esp), %esi
        addl    $12, %esp
        ret

After:

LLM:    subl    $8, %esp
        movl    %ebx, (%esp)
        movl    20(%esp), %eax
        movl    %esi, 4(%esp)
        movl    24(%esp), %ecx
        movl    12(%esp), %esi
        movl    16(%esp), %ebx
        imull   %esi, %ecx
        imull   %eax, %ebx
        mull    %esi
        movl    4(%esp), %esi
        addl    %ebx, %ecx
        movl    (%esp), %ebx
        addl    $8, %esp
        leal    (%ecx,%edx), %edx
        ret
Comment 6 Steven Bosscher 2007-11-23 20:48:32 UTC
*** Bug 6585 has been marked as a duplicate of this bug. ***
Comment 7 Paolo Bonzini 2007-12-18 07:43:13 UTC
The generated code has changed a lot recently, though it still uses two spills:

        pushl   %esi
        pushl   %ebx
        movl    12(%esp), %ebx         ; load alow
        movl    20(%esp), %esi         ; load blow
        movl    24(%esp), %ecx         ; load bhigh
        imull   %ebx, %ecx             ; bhigh = bhigh * alow
        movl    16(%esp), %eax         ; load ahigh
        imull   %esi, %eax             ; ahigh = ahigh * blow
        addl    %eax, %ecx
        movl    %esi, %eax             ; need to move a low part in %eax
        mull    %ebx
        leal    (%ecx,%edx), %esi      ; what the heck, a simple addl could do!
        movl    %esi, %edx
        popl    %ebx
        popl    %esi

The important thing is that the adds are one before, and one after the multiply.      And the RTL before reload is very nice!

   28 r64:SI=[argp:SI]
   30 r66:SI=[argp:SI+0x8]
    7 {r62:SI=[argp:SI+0xc]*r64:SI;clobber flags:CC;}
    8 {r63:SI=[argp:SI+0x4]*r66:SI;clobber flags:CC;}
    9 {r62:SI=r62:SI+r63:SI;clobber flags:CC;}
   10 {r61:DI=zero_extend(r66:SI)*zero_extend(r64:SI);clobber flags:CC;}
   12 {r61:DI#4=r62:SI+r61:DI#4;clobber flags:CC;}

That could be basically

   mov (blow), %eax
   mov (alow), %ecx
   imul (bhigh), %eax, %ebx  ;needs reload of (bh) in %ebx
   imul (ahigh), %ecx, %edx  ;needs reload of (ah) in %edx
   addl %edx, %ebx
   mul  %ecx
   addl %ebx, %edx

So, if we are content with a single spill, there is now mainly a register allocation problem.  For zero spills, we would need rematerialization of either alow or blow, as in:

   mov (blow), %eax
   imul (bhigh), %eax, %ecx  ;needs reload of (bh) in %ecx
   imul (ahigh), (alow), %edx  ;needs reload of (ah) in %edx
   addl %edx, %ecx
   mul  (al)
   addl %ebx, %edx

So, for now let's aim at having one spill.  Then the main problem is that gcc should try to keep one of alow or blow in %eax, in order to reuse it for the mull.  Then it could do with a single spill, like this:

        pushl   %ebx
        movl    12(%esp), %eax         ; load alow
        movl    20(%esp), %ecx         ; load blow
        movl    24(%esp), %ebx         ; load bhigh, input reload
        imull   %eax, %ebx
        movl    16(%esp), %edx         ; load ahigh, input reload
        imull   %ecx, %edx
        addl    %edx, %ebx
        mul     %ecx
        addl    %ebx, %edx
        popl    %ebx
Comment 8 Paolo Bonzini 2007-12-18 07:53:28 UTC
For -mregparm=2 the code is this:

        pushl   %ebx
        movl    8(%esp), %ebx
        movl    12(%esp), %ecx
        imull   %ebx, %edx
        imull   %eax, %ecx
        addl    %edx, %ecx
        mull    %ebx
        popl    %ebx
        leal    (%ecx,%edx), %edx
        ret

It is better only for luck: the low part of "a" is already in %eax, so that regclass decides for the "right" preference.

  Register 64 pref AREG, else GENERAL_REGS

Paolo
Comment 9 Paolo Bonzini 2007-12-18 08:05:35 UTC
Uros mentioned offlist that he wanted to hijack fwprop to always propagate stack slots into instructions.

It would be a relatively useful piece of infrastructure to have a flag in MEMs that marks on-stack MEMs, because other MEMs definitely must not be propagated blindly.
Comment 10 Eric Botcazou 2007-12-18 08:10:08 UTC
> It would be a relatively useful piece of infrastructure to have a flag in MEMs
> that marks on-stack MEMs, because other MEMs definitely must not be propagated
> blindly.

Depending on your needs, MEM_NOTRAP_P could be a good approximation.
Comment 11 Uroš Bizjak 2007-12-18 13:47:21 UTC
Generated code for a similar example is just plain stupid:

--cut here--
int test(long long a, long long b)
{
	return a * b;
}
--cut here--

gcc -O3:

test:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    %esi, 4(%esp)
        movl    16(%ebp), %esi
        movl    %ebx, (%esp)
        movl    8(%ebp), %ebx
        movl    %esi, %eax
        movl    4(%esp), %esi
        mull    %ebx
        movl    (%esp), %ebx
        movl    %ebp, %esp
        popl    %ebp
        ret

gcc-4.0 is a little less creative and creates:

test:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        movl    8(%ebp), %eax
        mull    16(%ebp)
        popl    %ebx
        popl    %esi
        popl    %edi
        leave
        ret
Comment 12 Paolo Bonzini 2007-12-18 16:01:37 UTC
The problem in comment #11 is that GCC generates a widening multiply, and cannot remove the DImode operations until after register allocation (!).  While the root cause is a deficiency in RTL-level DCE, I suggest filing a separate bug, because it could also be fixed on the tree-level.
Comment 13 Jakub Jelinek 2007-12-18 16:39:27 UTC
I think tree level does the right thing, TER fixes this up and expand_expr
is called with
return (int) (b * a)
Later on expand_expr is called with
 <mult_expr 0x2aaaae9032c0
    type <integer_type 0x2aaaae937840 long long int DI
        size <integer_cst 0x2aaaae927c60 constant invariant 64>
        unit size <integer_cst 0x2aaaae927c90 constant invariant 8>
        align 64 symtab 0 alias set 2 canonical type 0x2aaaae937840 precision 64 min <integer_cst 0x2aaaae927bd0 -9223372036854775808> max <integer_cst 0x2aaaae927c00 9223372036854775807>>
   
    arg 0 <parm_decl 0x2aaaae92d2d0 b type <integer_type 0x2aaaae937840 long long int>
        used DI file pr17236.c line 2 col 30 size <integer_cst 0x2aaaae927c60 64> unit size <integer_cst 0x2aaaae927c90 8>
        align 64 context <function_decl 0x2aaaae9e8a90 test> initial <integer_type 0x2aaaae937840 long long int>
        (reg/v:DI 60 [ b ]) arg-type <integer_type 0x2aaaae937840 long long int>
        incoming-rtl (mem/c/i:DI (plus:SI (reg/f:SI 53 virtual-incoming-args)
        (const_int 8 [0x8])) [2 b+0 S8 A32])>
    arg 1 <parm_decl 0x2aaaae92d240 a type <integer_type 0x2aaaae937840 long long int>
        used DI file pr17236.c line 2 col 17 size <integer_cst 0x2aaaae927c60 64> unit size <integer_cst 0x2aaaae927c90 8>
        align 64 context <function_decl 0x2aaaae9e8a90 test> initial <integer_type 0x2aaaae937840 long long int>
        (reg/v:DI 59 [ a ]) arg-type <integer_type 0x2aaaae937840 long long int>
        incoming-rtl (mem/c/i:DI (reg/f:SI 53 virtual-incoming-args) [2 a+0 S8 A32]) chain <parm_decl 0x2aaaae92d2d0 b>>>
and tmode SImode, still enough info to choose a better multiply.
Comment 14 Paolo Bonzini 2007-12-18 16:50:52 UTC
The bug with 64*64->32 multiplication is now PR34522.
Comment 15 Uroš Bizjak 2007-12-18 18:20:16 UTC
(In reply to comment #7)

>         mull    %ebx
>         leal    (%ecx,%edx), %esi   ; what the heck, a simple addl could do!
>         movl    %esi, %edx

Something disturbs RA to emit two DImode moves:

(insn:HI 10 36 37 2 m.c:2 (parallel [
            (set (reg:DI 0 ax)
                (mult:DI (zero_extend:DI (reg:SI 0 ax))
                    (zero_extend:DI (reg:SI 3 bx [orig:64 a ] [64]))))
            (clobber (reg:CC 17 flags))
        ]) 304 {*umulsidi3_insn} (nil))

(insn 37 10 11 2 m.c:2 (set (reg:DI 3 bx [61])
        (reg:DI 0 ax)) 88 {*movdi_2} (nil))

(note:HI 11 37 12 2 NOTE_INSN_DELETED)

(insn:HI 12 11 18 2 m.c:2 (parallel [
            (set (reg:SI 4 si [+4 ])
                (plus:SI (reg:SI 2 cx [62])
                    (reg:SI 4 si [+4 ])))
            (clobber (reg:CC 17 flags))
        ]) 249 {*addsi_1} (nil))

(insn:HI 18 12 24 2 m.c:4 (set (reg/i:DI 0 ax [ <result> ])
        (reg:DI 3 bx [61])) 88 {*movdi_2} (nil))

(insn 24 18 33 2 m.c:4 (use (reg/i:DI 0 ax [ <result> ])) -1 (nil))

Note two moves [(insn 36) and (insn 37)] around (insn 12).
Comment 16 Uroš Bizjak 2007-12-18 18:33:21 UTC
(In reply to comment #15)

> Note two moves [(insn 36) and (insn 37)] around (insn 12).

Bah. This is the correct sequence, around (insn 10) that seems to be the root of all problems:

(insn:HI 9 8 36 2 m.c:2 (parallel [
            (set (reg:SI 2 cx [62])
                (plus:SI (reg:SI 2 cx [62])
                    (reg:SI 0 ax [63])))
            (clobber (reg:CC 17 flags))
        ]) 249 {*addsi_1} (nil))

(insn 36 9 10 2 m.c:2 (set (reg:SI 0 ax)
        (reg:SI 4 si [orig:66 b ] [66])) 47 {*movsi_1} (nil))

(insn:HI 10 36 37 2 m.c:2 (parallel [
            (set (reg:DI 0 ax)
                (mult:DI (zero_extend:DI (reg:SI 0 ax))
                    (zero_extend:DI (reg:SI 3 bx [orig:64 a ] [64]))))
            (clobber (reg:CC 17 flags))
        ]) 304 {*umulsidi3_insn} (nil))

(insn 37 10 11 2 m.c:2 (set (reg:DI 3 bx [61])
        (reg:DI 0 ax)) 88 {*movdi_2} (nil))


DImode AX as found in (insn 10) could simply be propagated up and down RTL stream as SImode destination of (insn 8) and SImode source of (insn 12). For some reason, RA is afraid to allocate registers in mixed modes.
Comment 17 Paolo Bonzini 2007-12-19 09:49:25 UTC
With this patch, GCC gets the preferences right, but it does not affect code generation.

Index: regclass.c
===================================================================
--- regclass.c  (revision 130928)
+++ regclass.c  (working copy)
@@ -1651,9 +1651,15 @@ record_reg_classes (int n_alts, int n_op
                          [(unsigned char) reg_pref[REGNO (op)].prefclass]
                          [(int) classes[i]]);
 
-                 if (REGNO (ops[i]) != REGNO (ops[j])
-                     && ! find_reg_note (insn, REG_DEAD, op))
-                   alt_cost += 2;
+                 if (REGNO (ops[i]) != REGNO (ops[j]))
+                   {
+                     /* If the pseudo dies, tying it to the duplicate
+                        operand can be advantageous.  */
+                     if (find_reg_note (insn, REG_DEAD, op))
+                       pp->cost[classes[j]]--;
+                     else
+                       alt_cost += 2;
+                   }
 
                  /* This is in place of ordinary cost computation
                     for this operand, so skip to the end of the
Comment 18 Uroš Bizjak 2007-12-19 12:11:10 UTC
Another baby step can be performed by:

Index: optabs.c
===================================================================
--- optabs.c    (revision 131053)
+++ optabs.c    (working copy)
@@ -1245,7 +1245,7 @@ expand_doubleword_mult (enum machine_mod
        return NULL_RTX;
     }
 
-  adjust = expand_binop (word_mode, smul_optab, op0_high, op1_low,
+  adjust = expand_binop (word_mode, smul_optab, op1_low, op0_high,
                         NULL_RTX, 0, OPTAB_DIRECT);
   if (!adjust)
     return NULL_RTX;
@@ -1274,7 +1274,7 @@ expand_doubleword_mult (enum machine_mod
        return NULL_RTX;
     }
 
-  temp = expand_binop (word_mode, smul_optab, op1_high, op0_low,
+  temp = expand_binop (word_mode, smul_optab, op0_low, op1_high,
                       NULL_RTX, 0, OPTAB_DIRECT);
   if (!temp)
     return NULL_RTX;

This patch generates (note mem operands to imull):

test:
        subl    $8, %esp
        movl    %ebx, (%esp)
        movl    12(%esp), %ebx
        movl    %esi, 4(%esp)
        movl    20(%esp), %esi
        movl    %ebx, %ecx
        movl    %esi, %eax
        imull   24(%esp), %ecx
        imull   16(%esp), %eax
        addl    %eax, %ecx
        movl    %esi, %eax
        mull    %ebx
        movl    (%esp), %ebx
        leal    (%ecx,%edx), %esi
        movl    %esi, %edx
        movl    4(%esp), %esi
        addl    $8, %esp
        ret

but since mulsi3 is defined as:

(define_insn "*mulsi3_1"
  [(set (match_operand:SI 0 "register_operand" "=r,r,r")
	(mult:SI (match_operand:SI 1 "nonimmediate_operand" "%rm,rm,0")
		 (match_operand:SI 2 "general_operand" "K,i,mr")))
   (clobber (reg:CC FLAGS_REG))]

the question is, why didn't RA perform operand reversal by itself. Currently it converts _both_ SI multiplies from:

(insn:HI 8 7 9 2 m.c:14 (parallel [
            (set (reg:SI 63)
                (mult:SI (mem/c/i:SI (plus:SI (reg/f:SI 16 argp)
                            (const_int 4 [0x4])) [2 a+4 S4 A32])
                    (reg:SI 66 [ b ])))
            (clobber (reg:CC 17 flags))
        ]) 214 {*mulsi3_1} (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))

into

--cut here--
(insn 35 7 8 2 m.c:14 (set (reg:SI 0 ax [63])
        (mem/c/i:SI (plus:SI (reg/f:SI 7 sp)
                (const_int 16 [0x10])) [2 a+4 S4 A32])) 41 {*movsi_1} (nil))

(insn:HI 8 35 9 2 m.c:14 (parallel [
            (set (reg:SI 0 ax [63])
                (mult:SI (reg:SI 0 ax [63])
                    (reg:SI 4 si [orig:66 b ] [66])))
            (clobber (reg:CC 17 flags))
        ]) 214 {*mulsi3_1} (nil))
--cut here--

Comment 19 Paolo Bonzini 2007-12-19 12:13:45 UTC
Created attachment 14792 [details]
patch to almost fix the bug

With this patch:
1) local-alloc first tries to allocate registers that go into small classes
2) regclass tries to use the same class for an output and an input that dies
3) i386 modes_tiable_p allows tying SImode with DImode (but not vice versa)

I get now this:
        pushl   %esi
        pushl   %ebx
        movl    12(%esp), %ecx
        movl    20(%esp), %eax
        movl    24(%esp), %ebx
        imull   %ecx, %ebx
        movl    16(%esp), %esi
        imull   %eax, %esi
        addl    %esi, %ebx
        mull    %ecx
        leal    (%ebx,%edx), %edx
        popl    %ebx
        popl    %esi

which has no moves.  It still spills two registers because it is not able to use %edx (I'll look into it later).
Comment 20 Paolo Bonzini 2007-12-19 12:32:19 UTC
There is a big catch-22.  When GCC ties one of regs 64/66 with reg 61, it enlarges reg 61's live range to cover the live range of the tied range.  When it does this, it basically locks up %edx for the whole live range of reg 61.  :-(
Comment 21 Paolo Bonzini 2007-12-19 12:43:37 UTC
Created attachment 14793 [details]
two hunks only from the previous patch

Indeed, if I only use the regclass.c and local-alloc.c hunks, I get only one spill!

        pushl   %ebx
        movl    8(%esp), %edx
        movl    16(%esp), %eax
        movl    20(%esp), %ecx
        imull   %edx, %ecx
        movl    12(%esp), %ebx
        imull   %eax, %ebx
        addl    %ebx, %ecx
        mull    %edx
        leal    (%ecx,%edx), %edx
        popl    %ebx

and no reg-reg move.

With Uros' patch it does use memory imull's, but also adds reg-reg moves.  The next step should be to figure out why (with my patch and without his) reload feels the need to put the memory operands from insns 7 and 8 into regs.
Comment 22 Uroš Bizjak 2007-12-19 13:11:50 UTC
(In reply to comment #21)

> Indeed, if I only use the regclass.c and local-alloc.c hunks, I get only one
> spill!
> 
>         pushl   %ebx
>         movl    8(%esp), %edx
>         movl    16(%esp), %eax
>         movl    20(%esp), %ecx
>         imull   %edx, %ecx
>         movl    12(%esp), %ebx
>         imull   %eax, %ebx
>         addl    %ebx, %ecx
>         mull    %edx
>         leal    (%ecx,%edx), %edx
>         popl    %ebx
> 
> and no reg-reg move.

Excellent!

I think that we should fwprop stack arguments into the insns to further optimize the code. According to pentopt.pdf, there is no penalty at all (at least for P4+) when one of the insn operands is a memory reference. This would relieve register pressure considerably.
Comment 23 Paolo Bonzini 2007-12-19 13:36:01 UTC
Created attachment 14794 [details]
teach combine that reg = op(reg, mem) is better

Since combine operates on the whole pattern, it can be taught the trick that swap_commutative_operands_with_target does in optabs.c.  This is the "right way" to do the hack Uros proposed in comment #18.

Note that optabs.c does not get it because it creates reg=subreg*subreg.  Then CSE creates SImode memory loads from the subreg, and combine creates reg=mem*reg.  With this patch, combine creates reg=reg*mem instead.
Comment 24 Steven Bosscher 2007-12-19 13:48:44 UTC
The patch in comment #23 might even be suitable for GCC 4.3 ...
Comment 25 Paolo Bonzini 2007-12-19 13:50:25 UTC
Note that "fwprop" is not an exact term, because there *is* a memory load in each multiplication, and propagating a second memory operand will create an invalid insn.  You may try to add a split from reg=op(mem1, mem2) to reg=mem1;reg=reg*mem2, but anyway, it won't help propagating into the mull, since it is not LOG_LINKS-related to the load.

The problem is that we end up with having to reload *anyway*.  If we had a way to express the relationship between operands 0 and 1 of the multiplication *before reload*, probably the RTL optimizations could do much more.
Comment 26 Paolo Bonzini 2007-12-19 13:53:48 UTC
I'm starting a SPEC run on the overall patch
Comment 27 Paolo Bonzini 2007-12-20 13:53:04 UTC
I screwed up so I have to rerun most of SPECfp2000, but the results seem a wash.  Anybody can fire the patch I'll attach soon on a wide range of machines?
Comment 28 Paolo Bonzini 2007-12-20 14:15:47 UTC
Created attachment 14800 [details]
combined patch
Comment 29 Paolo Bonzini 2008-03-07 08:26:54 UTC
ira branch produces the same code as with my patch.
Comment 30 Uroš Bizjak 2008-09-06 16:11:05 UTC
Current mainline (4.4.0 20080906) produces:

	pushl	%ebx
	movl	8(%ebp), %eax
	movl	16(%ebp), %edx
	movl	20(%ebp), %ecx
	movl	12(%ebp), %ebx
	imull	%eax, %ecx
	imull	%edx, %ebx
	mull	%edx
	addl	%ebx, %ecx
	popl	%ebx
	leal	(%ecx,%edx), %edx

So, fixed.
Comment 31 Uroš Bizjak 2008-09-06 16:18:32 UTC
*** Bug 6585 has been marked as a duplicate of this bug. ***