Bug 18463 - [4.0 Regression] suboptimal use of fancy x86 addressing modes
[4.0 Regression] suboptimal use of fancy x86 addressing modes
Status: RESOLVED FIXED
Product: gcc
Classification: Unclassified
Component: tree-optimization
4.0.0
: P2 normal
: 4.1.0
Assigned To: Zdenek Dvorak
: missed-optimization
: 17647 (view as bug list)
Depends on:
Blocks: 17647
  Show dependency treegraph
 
Reported: 2004-11-13 17:09 UTC by Steven Bosscher
Modified: 2007-01-18 03:09 UTC (History)
5 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work: 4.1.0 4.1.1 4.1.2 4.2.0
Known to fail: 4.0.3
Last reconfirmed: 2005-12-27 00:26:17


Attachments
Patch to cse to canon the rtx for addresses (1.69 KB, patch)
2004-11-13 20:18 UTC, Andrew Pinski
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Bosscher 2004-11-13 17:09:40 UTC
The item "Moving floating point through integer registers" 
on http://gcc.gnu.org/projects/optimize.html shows how GCC 
can move a float array element via an integer register. 
 
Consider the following test case: 
 
void 
fcpy(float *restrict a,  float *restrict b, 
     float *restrict aa, float *restrict bb, int n) 
{ 
        int i; 
        for(i = 0; i < n; i++) { 
                aa[i]=a[i]; 
                bb[i]=b[i]; 
        } 
} 
 
GCC 3.3 produces the following code for the inner loop for 
-O2 -fomit-frame-pointer -fschedule-insns --std=c99: 
 
.L6: 
        movl    (%ebp,%ecx,4), %eax 
        movl    (%edi,%ecx,4), %edx 
        movl    %eax, (%esi,%ecx,4) 
        movl    %edx, (%ebx,%ecx,4) 
        incl    %ecx 
        cmpl    36(%esp), %ecx 
        jl      .L6 
 
GCC 4.0 (4.0.0 20041112) produces this piece of junk: 
 
.L4: 
        movl    20(%esp), %eax 
        leal    0(,%ebx,4), %edx 
        cmpl    %ebx, %ebp 
        movl    -4(%eax,%edx), %ecx 
        leal    1(%ebx), %eax 
        movl    %eax, %ebx 
        movl    %ecx, -4(%edi,%edx) 
        movl    24(%esp), %ecx 
        movl    -4(%edx,%ecx), %eax 
        movl    %eax, -4(%edx,%esi) 
        jg      .L4 
 
This is a bad regression from at least gcc 2.95 and gcc 3.1, 
as demonstraded on the projects/optimize.html page.
Comment 1 Andrew Pinski 2004-11-13 17:22:54 UTC
Confirmed, the problem is that DOM does:
  D.1192 = (unsigned int) i;
  D.1194 = (float * restrict) D.1192 * 4B;
  *(aa2 + D.1194) = *(a2 + D.1194);
  *(bb2 + D.1194) = *(b2 + D.1194);

Note how we use D.1194 in all three places. for PPC this is the correct thing to do but not for x86 which 
has three operands loads.
Comment 2 Andrew Pinski 2004-11-13 17:42:05 UTC
Though I should note that PPC is much better on the mainline than before:
gcc 4.0.0:
L4:
        lfsx f0,r3,r2
        stfsx f0,r5,r2
        lfsx f13,r4,r2
        stfsx f13,r6,r2
        addi r2,r2,4
        bdnz L4

gcc 3.3 (Apple's):
L9:
        slwi r7,r11,2
        addi r11,r11,1
        lfsx f0,r7,r3
        stfsx f0,r7,r5
        lfsx f1,r7,r4
        stfsx f1,r7,r6
        bdnz L9

So really this is a target specific bug :).

Also here the loop for x86_64:
.L4:
        movl    (%rdx,%r10), %eax
        incl    %ecx
        movl    %eax, (%rdx,%r9)
        movl    (%rdx,%rdi), %eax
        movl    %eax, (%rdx,%rsi)
        addq    $4, %rdx
        cmpl    %ecx, %r8d
        jg      .L4

Note changing the type of n and i to be unsigned we get slightly better code:
.L4:
        movl    -16(%ebp), %ebx
        leal    0(,%ecx,4), %eax
        incl    %ecx
        cmpl    %ecx, 24(%ebp)
        movl    (%ebx,%eax), %edx
        movl    -20(%ebp), %ebx
        movl    %edx, (%edi,%eax)
        movl    (%esi,%eax), %edx
        movl    %edx, (%ebx,%eax)
        jne     .L4

So IV-OPTs is not doing its job correctly in one place.
Comment 3 Steven Bosscher 2004-11-13 17:52:19 UTC
At least x86 and ARM have {reg + reg OP const} addressing 
modes.  Unfortunately we rip such expressions apart already 
in the gimplifier.  This is something we canot fix properly 
on trees.  TER could perhaps do it, but that pass should 
really go away itself, and we don't know anything about 
addressing modes on trees anyway.   Looks like we need to 
teach an RTL loop optimizer about this... 
 
Comment 4 Andrew Pinski 2004-11-13 17:54:23 UTC
For PPC at least IV-OPTS should note that we have post increment and decrement the pointers before 
the loop and then increment all of them inside the loop, aka:
void
fcpy(float *restrict a,  float *restrict b,
     float *restrict aa, float *restrict bb, unsigned n)
{
        unsigned i;
        aa-=1; a-=1; bb-=1; b-=1;
        for(i = 0; i < n; i++) {
        aa+=1; a+=1; bb+=1; b+=1;
                *bb=*b;
                *aa=*a;
        }
}
So we get:
L4:
        lfsu f0,4(r4)
        lfsu f13,4(r3)
        stfsu f0,4(r6)
        stfsu f13,4(r5)
        bdnz L4
which is the most optimal for PPC
Comment 5 Andrew Pinski 2004-11-13 18:14:04 UTC
Here is the reduced testcase for the problem, it has nothing to do with loops at all:
void
fcpy(float *restrict a,  float *restrict b,
     float *restrict aa, float *restrict bb, unsigned n)
{
                aa[n]=a[n];
                bb[n]=b[n];
}

DOM is doing CSE of n*4 which is the right thing to do.
Comment 6 Andrew Pinski 2004-11-13 18:46:18 UTC
This is a RTL problem as it works correctly on ARM which has it ...

I should note that arm's instruction has nothing special in its .md file:
(define_insn "*arm_movsf_soft_insn"
  [(set (match_operand:SF 0 "nonimmediate_operand" "=r,r,m")
        (match_operand:SF 1 "general_operand"  "r,mE,r"))]
  "TARGET_ARM
Comment 7 Steven Bosscher 2004-11-13 19:34:53 UTC
CSE is trying to reconstruct the addressing mode, but it has 
 
(plus:SI (ashift:SI (reg/v:SI 61 [ n ]) (const_int 2 [0x2])) 
         (reg/v/f:SI 59 [ a ])) 
 
According to hp, the canonical form would be with a mult. 
 
Breakpoint 9, ix86_decompose_address (addr=0x401d515c, out=0xbfffec48) 
    at ../../mainline/gcc/config/i386/i386.c:4567 
4567    { 
1: debug_rtx (addr) = (plus:SI (ashift:SI (reg/v:SI 61 [ n ]) 
        (const_int 2 [0x2])) 
    (reg/v/f:SI 59 [ a ])) 
void 
(gdb) bt 
#0  ix86_decompose_address (addr=0x401d5078, out=0xbfffec78) 
    at ../../mainline/gcc/config/i386/i386.c:4567 
#1  0x082ee548 in legitimate_address_p (mode=3221220472, addr=0x401d5078, 
strict=0) 
    at ../../mainline/gcc/config/i386/i386.c:5064 
#2  0x0828ec5e in memory_address_p (mode=3221220472, addr=0xbfffec78) 
    at ../../mainline/gcc/recog.c:1268 
#3  0x0828ed5f in general_operand (op=0x401d5084, mode=SFmode) at ../../
mainline/gcc/recog.c:979 
#4  0x0828f029 in nonimmediate_operand (op=0x401d5084, mode=3221220472) 
    at ../../mainline/gcc/recog.c:1143 
#5  0x08226803 in recog_16 (x0=0x401d50b4, insn=0x4016c230, pnum_clobbers=0x0) 
    at insn-recog.c:16332 
#6  0x0828e07c in recog_memoized_1 (insn=0x4016c230) at ../../mainline/gcc/
recog.c:123 
#7  0x08121594 in cse_insn (insn=0x4016c230, libcall_insn=0x0) at ../../
mainline/gcc/cse.c:4783 
Comment 8 Andrew Pinski 2004-11-13 20:18:12 UTC
Created attachment 7541 [details]
Patch to cse to canon the rtx for addresses

I have not bootstrapped this at all but I thought I would attach it here.
Comment 9 CVS Commits 2004-11-25 23:10:34 UTC
Subject: Bug 18463

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	pinskia@gcc.gnu.org	2004-11-25 23:10:27

Modified files:
	gcc            : ChangeLog cse.c 

Log message:
	2004-11-25  Andrew Pinski <pinskia@physics.uc.edu>
	
	parts of PR rtl-opt/18463, rtl-opt/17647
	* cse.c (canon_for_address): New function.
	(find_best_addr): Call canon_for_address before getting the
	address's cost when checking if we should take that address.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.6555&r2=2.6556
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/cse.c.diff?cvsroot=gcc&r1=1.325&r2=1.326

Comment 10 Andrew Pinski 2004-11-25 23:29:49 UTC
This is mostly a iv-opts problem.
But note we still don't get the most optimal code with -fno-ivopts:
.L4:
        movl    8(%ebp), %ebx
        movl    (%ebx,%edx,4), %eax
        movl    20(%ebp), %ebx
        movl    %eax, (%esi,%edx,4)
        movl    (%edi,%edx,4), %eax
        movl    %eax, (%ebx,%edx,4)
        incl    %edx
        cmpl    %edx, %ecx
        jg      .L4

But that is because of we are pulling in the load from the agruments into the loop (that is a different 
bug but I think I should mark that as a regression).

We still get the same asm as given in comment #0 with -fivopts still on.
Comment 11 Andrew Pinski 2004-11-25 23:55:04 UTC
PR 18137 is the one which is about reload fucking up and pull the load of the arguments into the loop.
Comment 12 Andrew Pinski 2004-11-26 05:11:25 UTC
Actually I missed that you have to use -fomit-frame-pointer, so this is not related to PR 18137 after all.
Comment 13 Zdenek Dvorak 2004-11-26 08:12:37 UTC
The problem indeed is ivopts - dom interaction. Ivopts decide that since
reg + 4 * reg is a cheap addressing mode, there is no reason to do anything
else than what it does.  To cure this we need to be able to allow ivopts
to express more clearly that it does not want an expression to be played with;
I think the best solution is to have a tree code that would map directly to the
memory access (including the addressing mode).  I am working on the patch.
Comment 14 Zdenek Dvorak 2004-11-28 22:56:08 UTC
I have the (experimental) patch for addressing mode selection on trees
(http://atrey.karlin.mff.cuni.cz/~rakdver/diff_lower_address.diff).
It indeed helps; we get

  i = 0;

<L0>:;
  mem[aa + 4B * i]{*D.1047} = mem[a + 4B * i]{*D.1048};
  mem[bb + 4B * i]{*D.1050} = mem[b + 4B * i]{*D.1051};
  i = (int) ((unsigned int) i + 1);
  if (n > i) goto <L0>; else goto <L2>;

<L2>:;

in .vars dump and

.L4:
        movl    (%ebp,%edx,4), %eax
        movl    %eax, (%esi,%edx,4) 
        movl    (%edi,%edx,4), %eax 
        movl    %eax, (%ebx,%edx,4)
        incl    %edx
        cmpl    %edx, %ecx
        jg      .L4

in the assembler, which seems fine (except that the memory references are not
reordered; maybe some of the aliasing information gets lost due to the patch
currently).

The patch definitely won't make it for 4.0, but I would like to get it or
something similar to 4.1.
Comment 15 Giovanni Bajo 2004-11-28 23:38:55 UTC
While the patch looks great to me, it is not feasable as you said for 4.0. 
Since this is a 4.0 regression, we should probably look for a way to fix this 
problem in a less intrusive (even if not totally correct) way on 4.0. Do you 
have any idea?
Comment 16 Andrew Pinski 2004-12-05 04:29:24 UTC
*** Bug 17647 has been marked as a duplicate of this bug. ***
Comment 17 Steven Bosscher 2005-01-29 02:34:38 UTC
*** Bug 19680 has been marked as a duplicate of this bug. ***
Comment 18 Steven Bosscher 2005-01-29 02:37:18 UTC
This problem is bigger than just the int<-->fp problem of the original 
bug report.  It's basically the same problem, namely a poor choice of 
addressing modes. 
Comment 19 tbp 2005-01-29 03:15:58 UTC
Some recent discussion about related symptoms.

http://gcc.gnu.org/ml/gcc/2005-01/msg01667.html
Comment 20 Zdenek Dvorak 2005-04-21 07:57:50 UTC
The up-to-date version of the patch that fixes the problem:

http://gcc.gnu.org/ml/gcc-patches/2005-04/msg01360.html
Comment 21 Andrew Pinski 2005-06-13 12:46:07 UTC
On the mainline we get now:
.L4:
        movl    (%ebp,%edx,4), %eax
        movl    %eax, (%ebx,%edx,4)
        movl    (%esi,%edx,4), %eax
        movl    %eax, (%ecx,%edx,4)
        incl    %edx
        cmpl    %edx, %edi
        jne     .L4

Which is better but the two loads/stores are not scheduled.
Comment 22 Jan Hubicka 2005-06-13 13:40:43 UTC
Subject: Re:  [4.0/4.1 Regression] suboptimal use of fancy x86 addressing modes

> 
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2005-06-13 12:46 -------
> On the mainline we get now:
> .L4:
>         movl    (%ebp,%edx,4), %eax
>         movl    %eax, (%ebx,%edx,4)
>         movl    (%esi,%edx,4), %eax
>         movl    %eax, (%ecx,%edx,4)
>         incl    %edx
>         cmpl    %edx, %edi
>         jne     .L4
> 
> Which is better but the two loads/stores are not scheduled.

-frename-registers?  But hardware will do that for you anyway.

Honza
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18463
> 
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
Comment 23 Andrew Pinski 2005-09-13 18:51:50 UTC
This is what we get one the mainline:
.L4:
        movl    (%ecx), %eax
        addl    $4, %ecx
        movl    %eax, (%edi,%edx,4)
        movl    (%ebp,%edx,4), %eax
        movl    %eax, (%esi,%edx,4)
        incl    %edx
        cmpl    %edx, %ebx
        jne     .L4

Note the code in comment #4  has a target patch which improves this a little further:
Index: i386.c
===============================================================
====
RCS file: /cvs/gcc/gcc/gcc/config/i386/i386.c,v
retrieving revision 1.858
diff -u -p -r1.858 i386.c
--- i386.c      6 Sep 2005 19:57:46 -0000       1.858
+++ i386.c      13 Sep 2005 18:49:44 -0000
@@ -5273,6 +5273,10 @@ ix86_address_cost (rtx x)
   /* More complex memory references are better.  */
   if (parts.disp && parts.disp != const0_rtx)
     cost--;
+
+  if (parts.scale != 1)
+    cost--;
+
   if (parts.seg != SEG_DEFAULT)
     cost--;
 

But since I don't have SPEC, I have not submitted the patch.  Steven could you test this patch and 
submit it for me?

ChangeLog (please make a better changelog): 
(ix86_address_cost): More complex is cheaper than anything else.
Comment 24 Steven Bosscher 2005-09-14 08:21:28 UTC
Comment on attachment 7541 [details]
Patch to cse to canon the rtx for addresses

Already in mainline
Comment 25 Zdenek Dvorak 2005-09-14 11:04:17 UTC
Auch; that patch is actually a very bad idea.  Pretending that complex
addressing modes are cheaper, when they are not, is just confusing.  If there
are optimizers that indeed want to prefer complex addressing modes (like cse and
combine, I suppose), they should reflect that in the cost function they are
using, rather than forcing this missconception to every other optimizer.
Comment 26 Richard Biener 2005-09-14 12:22:27 UTC
It looks like it is just following existing practice?
Comment 27 Zdenek Dvorak 2005-09-14 12:41:08 UTC
Subject: Re:  [4.0 Regression] suboptimal use of fancy x86 addressing modes

> It looks like it is just following existing practice?

yes, I know.  The practice is just wrong, though.
Comment 28 Zdenek Dvorak 2005-09-19 15:19:24 UTC
Another patch that improves the code on i686:  
  
http://gcc.gnu.org/ml/gcc-patches/2005-09/msg01159.html 
 
(with -O2 -fomit-frame-pointer:) 
 
.L4: 
        movl    (%ebp,%edx,4), %eax 
        movl    %eax, (%esi,%edx,4) 
        movl    (%edi,%edx,4), %eax 
        movl    %eax, (%ebx,%edx,4) 
        incl    %edx 
        cmpl    %edx, %ecx 
        jne     .L4 
 
Comment 29 CVS Commits 2005-09-20 07:09:38 UTC
Subject: Bug 18463

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rakdver@gcc.gnu.org	2005-09-20 07:09:22

Modified files:
	gcc            : ChangeLog tree-chrec.c tree-chrec.h 
	                 tree-scalar-evolution.c 

Log message:
	PR tree-optimization/18463
	* tree-chrec.c (chrec_convert): Return fold_converted chrec if
	converting it directly is not possible.
	(chrec_convert_aggressive): New function.
	* tree-chrec.h (chrec_convert_aggressive): Declare.
	* tree-scalar-evolution.c (instantiate_parameters_1, resolve_mixers):
	Fold chrec conversions aggressively if asked to.
	(instantiate_parameters): Modified because of changes in
	instantiate_parameters_1.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.9985&r2=2.9986
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-chrec.c.diff?cvsroot=gcc&r1=2.27&r2=2.28
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-chrec.h.diff?cvsroot=gcc&r1=2.10&r2=2.11
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-scalar-evolution.c.diff?cvsroot=gcc&r1=2.37&r2=2.38

Comment 30 Gabriel Dos Reis 2007-01-18 03:09:47 UTC
Fixed in GCC-4.1.0.
Not to be fixed in GCC-4.0.x