Bug 44382 - Slow integer multiply
Summary: Slow integer multiply
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.6.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 45671 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-06-02 15:04 UTC by H.J. Lu
Modified: 2011-10-21 14:41 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2010-06-02 15:15:06


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2010-06-02 15:04:07 UTC
For

---
extern int a, b, c, d, e, f;

void
foo ()
{
  a = f * b * c * d;
}
---

on Linux/x86-64, gcc generates:

	movl	b(%rip), %eax
	imull	f(%rip), %eax
	imull	c(%rip), %eax
	imull	d(%rip), %eax
	movl	%eax, a(%rip)
	ret

Icc generates:

        movl      c(%rip), %eax                                 #6.15
        movl      f(%rip), %edx                                 #6.7
        imull     d(%rip), %eax                                 #6.19
        imull     b(%rip), %edx                                 #6.11
        imull     %eax, %edx                                    #6.15
        movl      %edx, a(%rip)                                 #6.3
        ret                                                     #7.1

Icc version is faster since 2 multiplications may
be issued at the same time.
Comment 1 Richard Biener 2010-06-02 15:15:06 UTC
Because our tree reassoc doesn't re-associate them.
Comment 2 H.J. Lu 2010-06-04 13:08:34 UTC
(In reply to comment #1)
> Because our tree reassoc doesn't re-associate them.
> 

The tree reassoc pass makes it slower:

[hjl@gnu-6 44382]$ cat x.i
extern int a, b, c, d, e, f;
void
foo ()
{
  a = (b * c) * (d * e);
}
[hjl@gnu-6 44382]$ gcc -S -O2 x.i
[hjl@gnu-6 44382]$ cat x.s
	.file	"x.i"
	.text
	.p2align 4,,15
.globl foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	movl	c(%rip), %eax
	imull	b(%rip), %eax
	imull	d(%rip), %eax
	imull	e(%rip), %eax
	movl	%eax, a(%rip)
	ret
[hjl@gnu-6 44382]$ gcc -S -O2 x.i -fno-tree-reassoc
[hjl@gnu-6 44382]$ cat x.s
	.file	"x.i"
	.text
	.p2align 4,,15
.globl foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	movl	b(%rip), %eax
	movl	d(%rip), %edx
	imull	c(%rip), %eax
	imull	e(%rip), %edx
	imull	%edx, %eax
	movl	%eax, a(%rip)
	ret
[hjl@gnu-6 44382]$ 
Comment 3 Richard Biener 2010-06-04 13:21:24 UTC
Yes, reassoc linearizes instead of building a tree (saves one (or was it two?)
registers at best).
Comment 4 H.J. Lu 2010-06-04 13:56:24 UTC
(In reply to comment #3)
> Yes, reassoc linearizes instead of building a tree (saves one (or was it two?)
> registers at best).
> 

Should we always build a tree? It may increase register pressure.
Comment 5 H.J. Lu 2010-06-04 14:40:06 UTC
tree-ssa-reassoc.c has

    2. Left linearization of the expression trees, so that (A+B)+(C+D)
    becomes (((A+B)+C)+D), which is easier for us to rewrite later.
    During linearization, we place the operands of the binary
    expressions into a vector of operand_entry_t

I think this may always generate slower codes. We may not want to
use much more registers. We can limit us to 2 temporaries.
Comment 6 H.J. Lu 2010-09-15 04:29:21 UTC
*** Bug 45671 has been marked as a duplicate of this bug. ***
Comment 7 Bill Schmidt 2011-07-12 17:33:06 UTC
The test case from bug 45671 is as follows:

int myfunction (int a, int b, int c, int d, int e, int f, int g, int h) {
  int ret;

  ret = a + b + c + d + e + f + g + h;
  return ret;

}

Compiling with -O3 results in a series of dependent add instructions to
accumulate the sum.

        add 4,3,4
        add 4,4,5
        add 4,4,6
        add 4,4,7
        add 4,4,8
        add 4,4,9
        add 4,4,10


If we regrouped to (a+b)+(c+d)+... we can do multiple adds in parallel on
different execution units.
Comment 8 hjl@gcc.gnu.org 2011-09-06 16:42:56 UTC
Author: hjl
Date: Tue Sep  6 16:42:47 2011
New Revision: 178602

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=178602
Log:
PR middle-end/44382: Tree reassociation improvement

gcc/

2011-09-06  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): Likewise.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_uint_mode_1): New default hook.

	* hooks.c (hook_int_uint_mode_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(swap_ops_for_binary_stmt): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(rewrite_expr_tree): Refactored. Part of code moved into
	swap_ops_for_binary_stmt.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

gcc/testsuite/

2011-09-06  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/doc/invoke.texi
    trunk/gcc/doc/tm.texi
    trunk/gcc/doc/tm.texi.in
    trunk/gcc/hooks.c
    trunk/gcc/hooks.h
    trunk/gcc/params.def
    trunk/gcc/target.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
    trunk/gcc/tree-ssa-reassoc.c
Comment 9 Bill Schmidt 2011-10-13 17:30:14 UTC
Just adding some status information well after the fact...

We experimented with adding powerpc64 hooks to use the parallel reassociation support from comment #8.  We elected not to enable this support because the results for SPEC were negative (quite negative in some cases), due to increased register pressure in loops where spill was already an issue.  Our plans at this point are to live with the left-linear association, at least until the spill costs can be mitigated in some fashion.

If parallel reassociation had some heuristics that predicted for register pressure (difficult in tree-ssa, I know), it might become practical for us.
Comment 10 Bill Schmidt 2011-10-21 14:41:13 UTC
One more data point.  I repeated the experiment using -fsched-pressure.  Although this reduced the degradations considerably, the overall results are equivocal.  I see a few improvements and a few degradations in the 1-4% range, with the geometric means essentially unchanged.  So even if -fsched-pressure were the default, there wouldn't be an overwhelming case for enabling this support on powerpc64.