19580 – [4.3 Regression] missed load/store motion

Bug 19580 - [4.3 Regression] missed load/store motion

Summary: [4.3 Regression] missed load/store motion

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 minor
Target Milestone:	4.4.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	alias, missed-optimization

Depends on:	19581 24257
Blocks:
	Show dependency tree / graph

Reported:	2005-01-23 01:34 UTC by Serge Belyshev
Modified:	2009-04-22 15:08 UTC (History)
CC List:	7 users (show)

See Also:
Host:
Target:
Build:
Known to work:	3.3.3 2.95.3 3.0.4 3.2.3 4.4.0
Known to fail:	3.4.6 4.0.4 4.1.2 4.3.0 4.3.3
Last reconfirmed:	2007-07-04 20:57:28

Attachments
lim patch (15.63 KB, application/octet-stream) 2007-09-13 04:45 UTC, revital eres	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Serge Belyshev 2005-01-23 01:34:19 UTC

------------------------------------------------------------------------------
int r[6];

void f (int n)
{
  while (-- n)
    {
      r [0] += r [5];
      r [1] += r [0];
      r [2] += r [1];
      r [3] += r [2];
      r [4] += r [3];
      r [5] += r [4];
    }
}
------------------------------------------------------------------------------

On i386 with -O2 -fomit-frame-pointer we get:

.L4:
	movl	20(%esp), %ebp
	movl	8(%esp), %eax
	movl	16(%esp), %ebx
	incl	24(%esp)
	addl	%edi, %ebp
	leal	(%ebp,%eax), %esi
	movl	12(%esp), %eax
	movl	%ebp, 20(%esp)
	leal	(%esi,%ebx), %ecx
	movl	4(%esp), %ebx
	movl	%esi, 8(%esp)
	leal	(%ecx,%eax), %edx
	movl	%ecx, 16(%esp)
	leal	(%edx,%ebx), %eax
	movl	24(%esp), %ebx
	cmpl	%ebx, 28(%esp)
	movl	%edx, 12(%esp)
	leal	(%eax,%edi), %edi
	movl	%eax, 4(%esp)
	movl	%edi, (%esp)
	jne	.L4

workaround: -fno-gcse

Comment 1 Andrew Pinski 2005-01-23 01:36:57 UTC

Confirmed, only a regression from 3.3.3 which gave the following good code:
.L5:
        addl    %edi, %ebp
        addl    %ebp, %esi
        addl    %esi, %ecx
        addl    %ecx, %edx
        addl    %edx, %eax
        addl    %eax, %edi
        decl    %ebx
        jne     .L5

Comment 2 Steven Bosscher 2005-01-23 01:49:29 UTC

This is a tree-optimization bug for 4.0 and later.  I think the problem 
is that we don't catch the store motion opportunity at the tree level 
because to the tree alias analyses arrays are opaque objects.  On RTL 
we can distinguish the array members. 
 
If this is the problem, then Dan's sa-branch work should fix it... 
Dan?

Comment 3 Andrew Pinski 2005-01-23 01:53:22 UTC

(In reply to comment #2)
> This is a tree-optimization bug for 4.0 and later.  I think the problem 
> is that we don't catch the store motion opportunity at the tree level 
> because to the tree alias analyses arrays are opaque objects.  On RTL 
> we can distinguish the array members. 
Then why don't we optimize in 3.4.0 while we did in 3.3.3.

> If this is the problem, then Dan's sa-branch work should fix it... 
> Dan? 

I already filed PR  19581 for that because on PPC even in 3.3.3 we don't get optimial code.  But 
someone should look into what happened to changed the behavor in 3.4.0.

I am willing to bet it was the rewritten of the load/store monition.

Comment 4 Daniel Berlin 2005-01-23 02:08:57 UTC

Subject: Re:  [3.4/4.0 Regression] poor register
 allocation



On Sat, 23 Jan 2005, steven at gcc dot gnu dot org wrote:

>
> ------- Additional Comments From steven at gcc dot gnu dot org  2005-01-23 01:49 -------
> This is a tree-optimization bug for 4.0 and later.  I think the problem
> is that we don't catch the store motion opportunity at the tree level
> because to the tree alias analyses arrays are opaque objects.  On RTL
> we can distinguish the array members.
>
> If this is the problem, then Dan's sa-branch work should fix it...
> Dan?

It will once i handle constant index array refs (It currently punts on 
array refs entirely).

Comment 5 Serge Belyshev 2005-01-23 04:34:58 UTC

Caused by this patch:

2003-04-01  Zdenek Dvorak  <rakdver@atrey.karlin.mff.cuni.cz>

	* gcse.c (struct ls_expr): Added pattern_regs field.
	(ldst_entry): Initialize it.
	(extract_mentioned_regs, extract_mentioned_regs_helper): New.
	(store_ops_ok): Use regs precomputed by them.
	(find_loads, store_killed_in_insn, load_kills_store): Change return
	type to bool.
	(store_killed_before, store_killed_after): Take position of register
	set in account.
	(reg_set_info): Store position of the setter.
	(gcse_main): Enable store motion.
	(mems_conflict_for_gcse_p): Enable load motion of non-symbol mems.
	(pre_insert_copy_insn, update_ld_motion_stores, insert_store): Prevent rtl
	sharing.
	(simple_mem): Enable store motion of non-symbol mems.
	(regvec): Type changed.
	(LAST_AVAIL_CHECK_FAILURE): New.
	(compute_store_table_current_insn): New.
	(build_store_vectors): Computation of availability and anticipatability
	moved ...
	(compute_store_table, find_moveable_store): ... here.
	(delete_store): Remove senseless comment.
	(store_motion): Reorganize.

Comment 6 Andrew Pinski 2005-01-23 04:38:21 UTC

(In reply to comment #5)
> Caused by this patch:
And I was right this was caused by the store motion rewrite.
http://gcc.gnu.org/ml/gcc-patches/2003-02/msg02090.html

Comment 7 Andrew Pinski 2005-01-23 04:39:09 UTC

Note this is not a register allocator problem any more but a missed optimization.

Comment 8 Steven Bosscher 2005-01-23 11:13:14 UTC

The patch you identified makes RTL store motion work.  Before the patch 
gcse-sm just did almost nothing at all. 
 
You can't blame a patch for fixing a pass. 
 
Closing this as won't fix.  Lets focus on PR19581 instead.

Comment 9 Serge Belyshev 2005-01-23 14:36:20 UTC

(In reply to comment #8)
> Closing this as won't fix.  Lets focus on PR19581 instead. 

Two notes:

1) tree-outof-ssa does not coalesce variables in this case:

g.c:                             g.c.t65.optimized:
------------------------------------------------------------------------
void g ()                       |
{                               |       <bb 0>:
  r [0] += r [2];               |         D.1129 = r[0] + r[2];
  r [1] += r [0];               |         r[0] = D.1129;
  r [2] += r [1];               |         D.1131 = D.1129 + r[1];
}                               |         r[1] = D.1131;
                                |         r[2] = D.1131 + r[2];
                                |         return;
                                |
------------------------------------------------------------------------

... while it does in this:
f.c:                             f.c.t65.optimized:
------------------------------------------------------------------------
void f ()                       |
{                               |       <bb 0>:
  r [0] += r [1];               |         r[0] = r[0] + r[1];
  r [1] += r [2];               |         r[1] = r[1] + r[2];
  r [2] += r [0];               |         r[2] = r[2] + r[0];
}                               |         return;
                                |
------------------------------------------------------------------------

2) and disabling -fgcse-lm *fixes* original problem for 3.4 and 4.0.

I cannot believe this bug is WONTFIX for 4.0 just because it will be fixed in
4.1 at tree level.

Comment 10 Andrew Pinski 2005-01-23 15:24:58 UTC

Well since this is more than just a 4.0.0 regressions lets reopen it.  Basically lsm was rewritten for 3.4.0 
and it causes this regression which means lsm is not good at all.

Comment 11 Steven Bosscher 2005-01-23 15:41:33 UTC

This is *not* a gcc 4.0 regression.  The 4.0 problem is PR19581. 
 
But Andrew prefers keeping multiple bugs open for the same regression, 
apparently.

Comment 12 Andrew Pinski 2005-01-23 15:43:12 UTC

(In reply to comment #11)
> This is *not* a gcc 4.0 regression.  The 4.0 problem is PR19581. 
What about 3.4.0?

Comment 13 Steven Bosscher 2005-01-23 15:51:46 UTC

 
g.c:                             g.c.t65.optimized: 
------------------------------------------------------------------------ 
void g ()                       | 
{                               |       <bb 0>: 
  r [0] += r [2];               |         D.1129 = r[0] + r[2]; 
  r [1] += r [0];               |         r[0] = D.1129; 
  r [2] += r [1];               |         D.1131 = D.1129 + r[1]; 
}                               |         r[1] = D.1131; 
                                |         r[2] = D.1131 + r[2]; 
                                |         return; 
 
What exactly are you expecting to be coalesced in this case, if I may ask?

Comment 14 Steven Bosscher 2005-01-23 15:56:13 UTC

Clearly it *is* a 3.4 regression. 
But the subject also includes 4.0.  And that's not right because we have 
disabled gcse store motion for gcc 4.0.  The bug for 4.0 is PR19581 and 
that is just a different issue.

Comment 15 Serge Belyshev 2005-01-23 18:31:43 UTC

(In reply to comment #13)
> What exactly are you expecting to be coalesced in this case, if I may ask? 

I am expecting all of D.1129 and D.1131 to be coalesced so this:

    D.1129 = r[0] + r[2]; 
    r[0] = D.1129; 
    D.1131 = D.1129 + r[1]; 
    r[1] = D.1131; 
    r[2] = D.1131 + r[2]; 

...is turned into what it was (like in second example):

    r[0] = r[0] + r[2];
    r[1] = r[0] + r[1];
    r[2] = r[1] + r[2];

Comment 16 Daniel Berlin 2005-01-23 19:07:31 UTC

Subject: Re:  [3.4/4.0 Regression] missed load/store
 motion



On Sun, 23 Jan 2005, belyshev at depni dot sinp dot msu dot ru wrote:

>
> ------- Additional Comments From belyshev at depni dot sinp dot msu dot ru  2005-01-23 18:31 -------
> (In reply to comment #13)
>> What exactly are you expecting to be coalesced in this case, if I may ask?
>
> I am expecting all of D.1129 and D.1131 to be coalesced so this:
>
>    D.1129 = r[0] + r[2];
>    r[0] = D.1129;
>    D.1131 = D.1129 + r[1];
>    r[1] = D.1131;
>    r[2] = D.1131 + r[2];
>
> ...is turned into what it was (like in second example):
>
>    r[0] = r[0] + r[2];
>    r[1] = r[0] + r[1];
>    r[2] = r[1] + r[2];

It can't.
It believes they conflict right now.
.

Comment 17 Steven Bosscher 2005-01-23 19:11:00 UTC

Ehm, does it really think they conflict now, or is it simply not replacing 
a reg with a load?

Comment 18 Steven Bosscher 2005-01-23 19:23:38 UTC

For x86 I get this: 
g: 
        movl    r+8, %edx 
        movl    r, %eax 
        addl    %edx, %eax 
        movl    %eax, r 
        addl    r+4, %eax 
        movl    %eax, r+4 
        addl    %edx, %eax 
        movl    %eax, r+8 
        ret 
 
That is pretty much the best you can get, as far as I can tell. 
 
For AMD64 it's similar: 
 
g: 
.LFB2: 
        movl    r+8(%rip), %edx 
        movl    r(%rip), %eax 
        addl    %edx, %eax 
        movl    %eax, r(%rip) 
        addl    r+4(%rip), %eax 
        movl    %eax, r+4(%rip) 
        addl    %edx, %eax 
        movl    %eax, r+8(%rip) 
        ret 
.LFE2: 
 
I'm not sure what you think the missed optimization is here.  You will have 
to show what you want at the assembly level, and explain why you think this 
is a coalescing problem.  So far, I don't see a missed optimization. 
 
What is worse is that we fail to do store motion when you put such blocks 
inside a loop, e.g. 
 
int r[6]; 
void g (int n) 
{ 
  while (--n) 
    { 
      r [0] += r [1]; 
      r [1] += r [2]; 
      r [2] += r [0]; 
    } 
} 
 
which is the issue discussed in PR19581.

Comment 19 Serge Belyshev 2005-01-23 19:51:03 UTC

(In reply to comment #18)
> I'm not sure what you think the missed optimization is here.  You will have 
> to show what you want at the assembly level, and explain why you think this 
> is a coalescing problem.  So far, I don't see a missed optimization. 

Short examples in comment #9 I used only to show that there is problem with
variable coalescing at tree level, of course they all optimized at rtl level.

To see *the* problem, compare i386 assembly and .optimized dumps for these two
functions on mainline with patch to bug 19464 applied:

int r[6];

void f (int n)
{
  while (-- n)
    {
      r [0] += r [5];
      r [1] += r [0];
      r [2] += r [1];
      r [3] += r [2];
      r [4] += r [3];
      r [5] += r [4];
    }
}

void g (int n)
{
  while (-- n)
    {
      r [0] += r [1];
      r [1] += r [2];
      r [2] += r [3];
      r [3] += r [4];
      r [4] += r [5];
      r [5] += r [0];
    }
}

Comment 20 Andrew Pinski 2005-07-22 21:13:22 UTC

Moving to 4.0.2 pre Mark.

Comment 21 Andrew Pinski 2005-10-16 22:14:04 UTC

We really should turn on gcse-sm for 4.1 but then again maybe it is too late for that.

Comment 22 Uroš Bizjak 2005-10-17 10:28:25 UTC

(In reply to comment #21)
> We really should turn on gcse-sm for 4.1 but then again maybe it is too late
> for that.

Better not. Look at PR 24257, comment #4 and comment #5, why.

Comment 23 Mark Mitchell 2005-10-31 02:34:16 UTC

This is not a very helpful audit trail.  However, I agree that the two samples in Comment #19 would ideally result in pretty similar code.  If we think the fix is to turn on -fgcse-sm by default, then this bug depends on 24257.

I'll leave this as P2, for now.

Comment 24 Steven Bosscher 2006-02-03 16:44:38 UTC

Realistically this is not fixable for GCC 4.1.

Comment 25 Richard Biener 2006-02-04 13:52:42 UTC

On the mainline now even g() regresses, probably because of the reassoc pass rewrite.  Of course on the mainline this is also "fixed" by --param salias-max-array-elements=6, which makes load/store motion work on the tree level.
It looks like expand only with

<L0>:;
  r[0] = r[0] + r[1];
  r[1] = r[1] + r[2];
  r[2] = r[2] + r[3];
  r[3] = r[3] + r[4];
  r[4] = r[4] + r[5];
  r[5] = r[5] + r[0];
  ivtmp.63 = ivtmp.63 + 1;
  if (ivtmp.63 != (unsigned int) n.68) goto <L0>; else goto <L2>;

produces RTL that we can optimize to

.L13:
        addl    %edi, %esi
        incl    %ebp
        addl    %ebx, %edi
        addl    %ecx, %ebx
        addl    %edx, %ecx
        addl    %eax, %edx
        addl    %esi, %eax
        cmpl    (%esp), %ebp
        jne     .L13

but not for (mainline)

<L0>:;
  D.1544 = r[1] + r[0];
  r[0] = D.1544;
  r[1] = r[2] + r[1];
  r[2] = r[3] + r[2];
  r[3] = r[4] + r[3];
  r[4] = r[5] + r[4];
  r[5] = D.1544 + r[5];
  n.61 = n.61 - 1;
  if (n.61 != 0) goto <L0>; else goto <L2>;

Comment 26 Daniel Berlin 2006-02-04 21:30:29 UTC

Buzz, thanks for playing.
The reassoc rewrite has nothing to do with this. It won't actually touch those operations because they are memory loads and stores.

If you look at the reassoc dumps, the most it will do here is
Transforming D.1551_26 + D.1542_27 into D.1542_27 + D.1551_26;

(IE just swap the operands so they are in sorted order)

This has no effect on anything, it used to be done automatically, and is now done manually.

Comment 27 Richard Biener 2006-02-14 16:48:12 UTC

The mainline has again returned to sane behavior for g() and -fgcse-sm does not make any difference for f().  And we now use lea on i686:

.L11:
        addl    %edi, -16(%ebp)
        leal    (%ebx,%edi), %edi
        leal    (%ecx,%ebx), %ebx
        leal    (%edx,%ecx), %ecx
        leal    (%eax,%edx), %edx
        addl    -16(%ebp), %eax
        subl    $1, %esi
        jne     .L11

Comment 28 Mark Mitchell 2006-02-24 00:25:36 UTC

This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.

Comment 29 Mark Mitchell 2006-05-25 02:32:28 UTC

Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.

Comment 30 Daniel Berlin 2007-04-17 06:53:02 UTC

Is this really still broken in mainline?
At least as of Richard's last update, it wasn't

Comment 31 Richard Biener 2007-04-17 09:22:12 UTC

It's broken as we want the code from comment #1 back.

Comment 32 Richard Biener 2007-09-12 14:44:06 UTC

load-store motion at the tree level should really catch this.  For this it
needs to be extended to disambiguate aliases by looking at the actual memory
references:

<bb 4>:
  # r_8 = PHI <r_29(5), r_23(D)(3)>
  # n_11 = PHI <n_3(5), n_14(3)>
  # VUSE <r_8>
  D.1137_4 = r[0];
  # VUSE <r_8>
  D.1138_5 = r[5];
  D.1139_6 = D.1138_5 + D.1137_4;
  # r_24 = VDEF <r_8>
  r[0] = D.1139_6;
  # VUSE <r_24>
  D.1140_7 = r[1];
  D.1141_9 = D.1139_6 + D.1140_7;
...
  # r_29 = VDEF <r_28>
  r[5] = D.1148_21;
  n_3 = n_11 + -1;
  if (n_3 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

Zdenek, I think you had a patch to make lim more precise in this regard?

Comment 33 Zdenek Dvorak 2007-09-12 14:49:47 UTC

> Zdenek, I think you had a patch to make lim more precise in this regard?

Yes.  Revital Eres was trying to put it into shape suitable for mainline a few months ago, I am not sure what is the status of that?

Comment 34 revital eres 2007-09-12 15:09:37 UTC

I did not engage with it for some time so I doubt it if my latest version of the patch (which is originally in http://gcc.gnu.org/ml/gcc-patches/2007-01/msg02331.html) is suitable for current mainline.  I will certainly try to re-apply and post it as soon as possible.

Comment 35 revital eres 2007-09-13 04:45:37 UTC

Created attachment 14200 [details]
lim patch

As I suspected – my latest available version is not suitable for current mainline (I attached it anyway though).

Comment 36 Steven Bosscher 2008-01-23 08:54:40 UTC

What is the relation between the LIM patch and the alias oracle patch that was floated some time ago?

Comment 37 Richard Biener 2008-01-23 09:12:06 UTC

The lim patch applies alias-oracle techniques to the loop invariant and store
motion parts of tree-ssa-loop-im.c.  I have a forward-port of Zdeneks patch to
current trunk.

Note that the alias-oracle patch for SCCVN does not fix this PR (it seems to
be mostly ineffective for PRE, possibly PHI translation needs adjustments
as well).

Comment 38 Richard Biener 2008-03-28 13:40:19 UTC

Fixed in GCC 4.4 with the store-motion rewrite to use an alias-oracle:

<bb 3>:
  r_I_lsm.18 = r[5];
  r_I_lsm.13 = r[0];
  r_I_lsm.14 = r[1];
  r_I_lsm.15 = r[2];
  r_I_lsm.16 = r[3];
  r_I_lsm.17 = r[4];
  r_I_lsm.27 = r_I_lsm.18;

<bb 4>:
  r_I_lsm.13 = r_I_lsm.13 + r_I_lsm.27;
  r_I_lsm.14 = r_I_lsm.14 + r_I_lsm.13;
  r_I_lsm.15 = r_I_lsm.14 + r_I_lsm.15;
  r_I_lsm.16 = r_I_lsm.15 + r_I_lsm.16;
  r_I_lsm.17 = r_I_lsm.16 + r_I_lsm.17;
  r_I_lsm.18 = r_I_lsm.17 + r_I_lsm.18;
  n.26 = n.26 + -1;
  r_I_lsm.28 = r_I_lsm.27;
  r_I_lsm.27 = r_I_lsm.18;
  if (n.26 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

<bb 5>:
  r_I_lsm.27 = r_I_lsm.28;
  r[0] = r_I_lsm.13;
  r[1] = r_I_lsm.14;
  r[2] = r_I_lsm.15;
  r[3] = r_I_lsm.16;
  r[4] = r_I_lsm.17;
  r[5] = r_I_lsm.18;

I'll add a testcase to the testsuite.

Comment 39 Richard Biener 2008-03-28 13:45:47 UTC

Subject: Bug 19580

Author: rguenth
Date: Fri Mar 28 13:44:41 2008
New Revision: 133683

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133683
Log:
2008-03-28  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/19580
	* gcc.dg/tree-ssa/loop-34.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 40 Joseph S. Myers 2008-07-04 16:48:54 UTC

Closing 4.1 branch.

Comment 41 Joseph S. Myers 2009-03-31 16:45:10 UTC

Closing 4.2 branch.

Comment 42 Richard Biener 2009-04-22 15:08:42 UTC

WONTFIX on the 4.3 branch.