The item "Moving floating point through integer registers" on http://gcc.gnu.org/projects/optimize.html shows how GCC can move a float array element via an integer register. Consider the following test case: void fcpy(float *restrict a, float *restrict b, float *restrict aa, float *restrict bb, int n) { int i; for(i = 0; i < n; i++) { aa[i]=a[i]; bb[i]=b[i]; } } GCC 3.3 produces the following code for the inner loop for -O2 -fomit-frame-pointer -fschedule-insns --std=c99: .L6: movl (%ebp,%ecx,4), %eax movl (%edi,%ecx,4), %edx movl %eax, (%esi,%ecx,4) movl %edx, (%ebx,%ecx,4) incl %ecx cmpl 36(%esp), %ecx jl .L6 GCC 4.0 (4.0.0 20041112) produces this piece of junk: .L4: movl 20(%esp), %eax leal 0(,%ebx,4), %edx cmpl %ebx, %ebp movl -4(%eax,%edx), %ecx leal 1(%ebx), %eax movl %eax, %ebx movl %ecx, -4(%edi,%edx) movl 24(%esp), %ecx movl -4(%edx,%ecx), %eax movl %eax, -4(%edx,%esi) jg .L4 This is a bad regression from at least gcc 2.95 and gcc 3.1, as demonstraded on the projects/optimize.html page.
Confirmed, the problem is that DOM does: D.1192 = (unsigned int) i; D.1194 = (float * restrict) D.1192 * 4B; *(aa2 + D.1194) = *(a2 + D.1194); *(bb2 + D.1194) = *(b2 + D.1194); Note how we use D.1194 in all three places. for PPC this is the correct thing to do but not for x86 which has three operands loads.
Though I should note that PPC is much better on the mainline than before: gcc 4.0.0: L4: lfsx f0,r3,r2 stfsx f0,r5,r2 lfsx f13,r4,r2 stfsx f13,r6,r2 addi r2,r2,4 bdnz L4 gcc 3.3 (Apple's): L9: slwi r7,r11,2 addi r11,r11,1 lfsx f0,r7,r3 stfsx f0,r7,r5 lfsx f1,r7,r4 stfsx f1,r7,r6 bdnz L9 So really this is a target specific bug :). Also here the loop for x86_64: .L4: movl (%rdx,%r10), %eax incl %ecx movl %eax, (%rdx,%r9) movl (%rdx,%rdi), %eax movl %eax, (%rdx,%rsi) addq $4, %rdx cmpl %ecx, %r8d jg .L4 Note changing the type of n and i to be unsigned we get slightly better code: .L4: movl -16(%ebp), %ebx leal 0(,%ecx,4), %eax incl %ecx cmpl %ecx, 24(%ebp) movl (%ebx,%eax), %edx movl -20(%ebp), %ebx movl %edx, (%edi,%eax) movl (%esi,%eax), %edx movl %edx, (%ebx,%eax) jne .L4 So IV-OPTs is not doing its job correctly in one place.
At least x86 and ARM have {reg + reg OP const} addressing modes. Unfortunately we rip such expressions apart already in the gimplifier. This is something we canot fix properly on trees. TER could perhaps do it, but that pass should really go away itself, and we don't know anything about addressing modes on trees anyway. Looks like we need to teach an RTL loop optimizer about this...
For PPC at least IV-OPTS should note that we have post increment and decrement the pointers before the loop and then increment all of them inside the loop, aka: void fcpy(float *restrict a, float *restrict b, float *restrict aa, float *restrict bb, unsigned n) { unsigned i; aa-=1; a-=1; bb-=1; b-=1; for(i = 0; i < n; i++) { aa+=1; a+=1; bb+=1; b+=1; *bb=*b; *aa=*a; } } So we get: L4: lfsu f0,4(r4) lfsu f13,4(r3) stfsu f0,4(r6) stfsu f13,4(r5) bdnz L4 which is the most optimal for PPC
Here is the reduced testcase for the problem, it has nothing to do with loops at all: void fcpy(float *restrict a, float *restrict b, float *restrict aa, float *restrict bb, unsigned n) { aa[n]=a[n]; bb[n]=b[n]; } DOM is doing CSE of n*4 which is the right thing to do.
This is a RTL problem as it works correctly on ARM which has it ... I should note that arm's instruction has nothing special in its .md file: (define_insn "*arm_movsf_soft_insn" [(set (match_operand:SF 0 "nonimmediate_operand" "=r,r,m") (match_operand:SF 1 "general_operand" "r,mE,r"))] "TARGET_ARM
CSE is trying to reconstruct the addressing mode, but it has (plus:SI (ashift:SI (reg/v:SI 61 [ n ]) (const_int 2 [0x2])) (reg/v/f:SI 59 [ a ])) According to hp, the canonical form would be with a mult. Breakpoint 9, ix86_decompose_address (addr=0x401d515c, out=0xbfffec48) at ../../mainline/gcc/config/i386/i386.c:4567 4567 { 1: debug_rtx (addr) = (plus:SI (ashift:SI (reg/v:SI 61 [ n ]) (const_int 2 [0x2])) (reg/v/f:SI 59 [ a ])) void (gdb) bt #0 ix86_decompose_address (addr=0x401d5078, out=0xbfffec78) at ../../mainline/gcc/config/i386/i386.c:4567 #1 0x082ee548 in legitimate_address_p (mode=3221220472, addr=0x401d5078, strict=0) at ../../mainline/gcc/config/i386/i386.c:5064 #2 0x0828ec5e in memory_address_p (mode=3221220472, addr=0xbfffec78) at ../../mainline/gcc/recog.c:1268 #3 0x0828ed5f in general_operand (op=0x401d5084, mode=SFmode) at ../../ mainline/gcc/recog.c:979 #4 0x0828f029 in nonimmediate_operand (op=0x401d5084, mode=3221220472) at ../../mainline/gcc/recog.c:1143 #5 0x08226803 in recog_16 (x0=0x401d50b4, insn=0x4016c230, pnum_clobbers=0x0) at insn-recog.c:16332 #6 0x0828e07c in recog_memoized_1 (insn=0x4016c230) at ../../mainline/gcc/ recog.c:123 #7 0x08121594 in cse_insn (insn=0x4016c230, libcall_insn=0x0) at ../../ mainline/gcc/cse.c:4783
Created attachment 7541 [details] Patch to cse to canon the rtx for addresses I have not bootstrapped this at all but I thought I would attach it here.
Subject: Bug 18463 CVSROOT: /cvs/gcc Module name: gcc Changes by: pinskia@gcc.gnu.org 2004-11-25 23:10:27 Modified files: gcc : ChangeLog cse.c Log message: 2004-11-25 Andrew Pinski <pinskia@physics.uc.edu> parts of PR rtl-opt/18463, rtl-opt/17647 * cse.c (canon_for_address): New function. (find_best_addr): Call canon_for_address before getting the address's cost when checking if we should take that address. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.6555&r2=2.6556 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/cse.c.diff?cvsroot=gcc&r1=1.325&r2=1.326
This is mostly a iv-opts problem. But note we still don't get the most optimal code with -fno-ivopts: .L4: movl 8(%ebp), %ebx movl (%ebx,%edx,4), %eax movl 20(%ebp), %ebx movl %eax, (%esi,%edx,4) movl (%edi,%edx,4), %eax movl %eax, (%ebx,%edx,4) incl %edx cmpl %edx, %ecx jg .L4 But that is because of we are pulling in the load from the agruments into the loop (that is a different bug but I think I should mark that as a regression). We still get the same asm as given in comment #0 with -fivopts still on.
PR 18137 is the one which is about reload fucking up and pull the load of the arguments into the loop.
Actually I missed that you have to use -fomit-frame-pointer, so this is not related to PR 18137 after all.
The problem indeed is ivopts - dom interaction. Ivopts decide that since reg + 4 * reg is a cheap addressing mode, there is no reason to do anything else than what it does. To cure this we need to be able to allow ivopts to express more clearly that it does not want an expression to be played with; I think the best solution is to have a tree code that would map directly to the memory access (including the addressing mode). I am working on the patch.
I have the (experimental) patch for addressing mode selection on trees (http://atrey.karlin.mff.cuni.cz/~rakdver/diff_lower_address.diff). It indeed helps; we get i = 0; <L0>:; mem[aa + 4B * i]{*D.1047} = mem[a + 4B * i]{*D.1048}; mem[bb + 4B * i]{*D.1050} = mem[b + 4B * i]{*D.1051}; i = (int) ((unsigned int) i + 1); if (n > i) goto <L0>; else goto <L2>; <L2>:; in .vars dump and .L4: movl (%ebp,%edx,4), %eax movl %eax, (%esi,%edx,4) movl (%edi,%edx,4), %eax movl %eax, (%ebx,%edx,4) incl %edx cmpl %edx, %ecx jg .L4 in the assembler, which seems fine (except that the memory references are not reordered; maybe some of the aliasing information gets lost due to the patch currently). The patch definitely won't make it for 4.0, but I would like to get it or something similar to 4.1.
While the patch looks great to me, it is not feasable as you said for 4.0. Since this is a 4.0 regression, we should probably look for a way to fix this problem in a less intrusive (even if not totally correct) way on 4.0. Do you have any idea?
*** Bug 17647 has been marked as a duplicate of this bug. ***
*** Bug 19680 has been marked as a duplicate of this bug. ***
This problem is bigger than just the int<-->fp problem of the original bug report. It's basically the same problem, namely a poor choice of addressing modes.
Some recent discussion about related symptoms. http://gcc.gnu.org/ml/gcc/2005-01/msg01667.html
The up-to-date version of the patch that fixes the problem: http://gcc.gnu.org/ml/gcc-patches/2005-04/msg01360.html
On the mainline we get now: .L4: movl (%ebp,%edx,4), %eax movl %eax, (%ebx,%edx,4) movl (%esi,%edx,4), %eax movl %eax, (%ecx,%edx,4) incl %edx cmpl %edx, %edi jne .L4 Which is better but the two loads/stores are not scheduled.
Subject: Re: [4.0/4.1 Regression] suboptimal use of fancy x86 addressing modes > > ------- Additional Comments From pinskia at gcc dot gnu dot org 2005-06-13 12:46 ------- > On the mainline we get now: > .L4: > movl (%ebp,%edx,4), %eax > movl %eax, (%ebx,%edx,4) > movl (%esi,%edx,4), %eax > movl %eax, (%ecx,%edx,4) > incl %edx > cmpl %edx, %edi > jne .L4 > > Which is better but the two loads/stores are not scheduled. -frename-registers? But hardware will do that for you anyway. Honza > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18463 > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is.
This is what we get one the mainline: .L4: movl (%ecx), %eax addl $4, %ecx movl %eax, (%edi,%edx,4) movl (%ebp,%edx,4), %eax movl %eax, (%esi,%edx,4) incl %edx cmpl %edx, %ebx jne .L4 Note the code in comment #4 has a target patch which improves this a little further: Index: i386.c =============================================================== ==== RCS file: /cvs/gcc/gcc/gcc/config/i386/i386.c,v retrieving revision 1.858 diff -u -p -r1.858 i386.c --- i386.c 6 Sep 2005 19:57:46 -0000 1.858 +++ i386.c 13 Sep 2005 18:49:44 -0000 @@ -5273,6 +5273,10 @@ ix86_address_cost (rtx x) /* More complex memory references are better. */ if (parts.disp && parts.disp != const0_rtx) cost--; + + if (parts.scale != 1) + cost--; + if (parts.seg != SEG_DEFAULT) cost--; But since I don't have SPEC, I have not submitted the patch. Steven could you test this patch and submit it for me? ChangeLog (please make a better changelog): (ix86_address_cost): More complex is cheaper than anything else.
Comment on attachment 7541 [details] Patch to cse to canon the rtx for addresses Already in mainline
Auch; that patch is actually a very bad idea. Pretending that complex addressing modes are cheaper, when they are not, is just confusing. If there are optimizers that indeed want to prefer complex addressing modes (like cse and combine, I suppose), they should reflect that in the cost function they are using, rather than forcing this missconception to every other optimizer.
It looks like it is just following existing practice?
Subject: Re: [4.0 Regression] suboptimal use of fancy x86 addressing modes > It looks like it is just following existing practice? yes, I know. The practice is just wrong, though.
Another patch that improves the code on i686: http://gcc.gnu.org/ml/gcc-patches/2005-09/msg01159.html (with -O2 -fomit-frame-pointer:) .L4: movl (%ebp,%edx,4), %eax movl %eax, (%esi,%edx,4) movl (%edi,%edx,4), %eax movl %eax, (%ebx,%edx,4) incl %edx cmpl %edx, %ecx jne .L4
Subject: Bug 18463 CVSROOT: /cvs/gcc Module name: gcc Changes by: rakdver@gcc.gnu.org 2005-09-20 07:09:22 Modified files: gcc : ChangeLog tree-chrec.c tree-chrec.h tree-scalar-evolution.c Log message: PR tree-optimization/18463 * tree-chrec.c (chrec_convert): Return fold_converted chrec if converting it directly is not possible. (chrec_convert_aggressive): New function. * tree-chrec.h (chrec_convert_aggressive): Declare. * tree-scalar-evolution.c (instantiate_parameters_1, resolve_mixers): Fold chrec conversions aggressively if asked to. (instantiate_parameters): Modified because of changes in instantiate_parameters_1. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.9985&r2=2.9986 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-chrec.c.diff?cvsroot=gcc&r1=2.27&r2=2.28 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-chrec.h.diff?cvsroot=gcc&r1=2.10&r2=2.11 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-scalar-evolution.c.diff?cvsroot=gcc&r1=2.37&r2=2.38
Fixed in GCC-4.1.0. Not to be fixed in GCC-4.0.x