Bug 37364 - [4.4 Regression] IRA generates inefficient code due to missing regmove pass
Summary: [4.4 Regression] IRA generates inefficient code due to missing regmove pass
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.4.0
: P2 normal
Target Milestone: 4.4.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
: 38601 (view as bug list)
Depends on:
Blocks: 37243 37437 37948
  Show dependency treegraph
 
Reported: 2008-09-04 05:44 UTC by H.J. Lu
Modified: 2009-02-04 07:57 UTC (History)
9 users (show)

See Also:
Host:
Target: x86_64-unknown-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-09-06 15:57:14


Attachments
A patch (919 bytes, patch)
2008-09-04 20:30 UTC, H.J. Lu
Details | Diff
Reduced performance case from cpu2006/454.calculix (446 bytes, text/plain)
2008-10-24 08:36 UTC, Joey Ye
Details
A patch to re-enable regmove (314 bytes, patch)
2008-10-28 01:12 UTC, H.J. Lu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2008-09-04 05:44:32 UTC
bash-3.2$ cat /tmp/x.c
#include <mmintrin.h>
extern __m64 x, y;
unsigned long long  foo(__m64 m) {
  return _mm_cvtm64_si64(_mm_add_pi32(x, y));
}
bash-3.2$ ./xgcc -B./ -march=core2 -S -O2 /tmp/x.c
bash-3.2$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
.globl foo
	.type	foo, @function
foo:
.LFB129:
	.cfi_startproc
	movq	x(%rip), %mm0
	paddd	y(%rip), %mm0
	movq	%mm0, -8(%rsp)
	movq	-8(%rsp), %rax
	ret
	.cfi_endproc
.LFE129:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 4.4.0 20080903 (experimental) [trunk revision 139952]"
	.section	.note.GNU-stack,"",@progbits
bash-3.2$ ./xgcc -B./ -march=core2 -S -O2 /tmp/x.c -fno-ira
bash-3.2$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
.globl foo
	.type	foo, @function
foo:
.LFB129:
	.cfi_startproc
	movq	x(%rip), %mm0
	paddd	y(%rip), %mm0
	movd	%mm0, %rax
	ret
	.cfi_endproc
.LFE129:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 4.4.0 20080903 (experimental) [trunk revision 139952]"
	.section	.note.GNU-stack,"",@progbits
bash-3.2$ 

The problem is IRA turns

(insn 8 7 14 2 ../../include/mmintrin.h:300 (set (reg:V2SI 61)
        (plus:V2SI (reg:V2SI 63 [ x ])
            (mem/c/i:V2SI (symbol_ref:DI ("y") <var_decl 0x7fd36e03cc80 y>) [2 
y+0 S8 A64]))) 992 {*mmx_addv2si3} (expr_list:REG_DEAD (reg:V2SI 63 [ x ])
        (nil)))

(insn 14 8 20 2 /tmp/x.c:12 (set (reg/i:DI 0 ax)
        (subreg:DI (reg:V2SI 61) 0)) 89 {*movdi_1_rex64} (expr_list:REG_DEAD (re
g:V2SI 61)
        (nil)))

into

(insn 26 8 14 2 ../../include/mmintrin.h:300 (set (mem/c:V2SI (plus:DI (reg/f:DI
 7 sp)
                (const_int -8 [0xfffffffffffffff8])) [3 S8 A64])
        (reg:V2SI 29 mm0 [orig:63 x ] [63])) 946 {*movv2si_internal_rex64} (nil)
)

(insn:HI 14 26 20 2 /tmp/x.c:12 (set (reg/i:DI 0 ax)
        (mem/c:DI (plus:DI (reg/f:DI 7 sp)
                (const_int -8 [0xfffffffffffffff8])) [3 S8 A64])) 89 {*movdi_1_r
ex64} (nil))

while the old RA generates

(insn:HI 14 8 20 2 /tmp/x.c:12 (set (reg/i:DI 0 ax)
        (reg:DI 29 mm0 [orig:63 x ] [63])) 89 {*movdi_1_rex64} (nil))

The outputs from regmove pass are different. With IRA, we got

(insn:HI 8 7 14 2 ../../include/mmintrin.h:300 (set (reg:V2SI 61)
        (plus:V2SI (reg:V2SI 63 [ x ])
            (mem/c/i:V2SI (symbol_ref:DI ("y") <var_decl 0x7f66abfb5c80 y>) [2 y
+0 S8 A64]))) 992 {*mmx_addv2si3} (expr_list:REG_DEAD (reg:V2SI 63 [ x ])
        (nil)))

(insn:HI 14 8 20 2 /tmp/x.c:12 (set (reg/i:DI 0 ax)
        (subreg:DI (reg:V2SI 61) 0)) 89 {*movdi_1_rex64} (expr_list:REG_DEAD (re
g:V2SI 61)
        (nil)))

Without IRA, we got
(insn:HI 8 7 14 2 ../../include/mmintrin.h:300 (set (reg:V2SI 63 [ x ])
        (plus:V2SI (reg:V2SI 63 [ x ])
            (mem/c/i:V2SI (symbol_ref:DI ("y") <var_decl 0x7fd36e03cc80 y>) [2 
y+0 S8 A64]))) 992 {*mmx_addv2si3} (nil))

(insn:HI 14 8 20 2 /tmp/x.c:12 (set (reg/i:DI 0 ax)
        (subreg:DI (reg:V2SI 63 [ x ]) 0)) 89 {*movdi_1_rex64} (expr_list:REG_D
EAD (reg:V2SI 63 [ x ])
        (nil)))

Does it have any impact on code generation?
Comment 1 H.J. Lu 2008-09-04 16:02:34 UTC
"-O2 -march=core2 -fno-ira -fno-regmove" generates

	movq	x(%rip), %mm0
	paddd	y(%rip), %mm0
	movq	%mm0, -8(%rsp)
	movq	-8(%rsp), %rax

It seems that regmove isn't effective for IRA.
Comment 2 H.J. Lu 2008-09-04 16:13:15 UTC
(In reply to comment #1)
> "-O2 -march=core2 -fno-ira -fno-regmove" generates
> 
>         movq    x(%rip), %mm0
>         paddd   y(%rip), %mm0
>         movq    %mm0, -8(%rsp)
>         movq    -8(%rsp), %rax
> 
> It seems that regmove isn't effective for IRA.
> 

regmove is turned off for IRA. Revert the regmove.c change in
revision 139590

        * regmove.c (regmove_optimize): Don't do replacement of output for
        IRA.

fixes this regression. Vladimir, IRA should either deal with replacement
of output or let regmove handle it.
Comment 3 H.J. Lu 2008-09-04 17:43:39 UTC
The problem may be in IRA_COVER_CLASSES. -mtune=core2 turns on
TARGET_INTER_UNIT_MOVES, which means move between mmx and 64bit
integer registers is cheaper than load/store. But IRA doesn't
handle it properly.
Comment 4 H.J. Lu 2008-09-04 17:54:18 UTC
(In reply to comment #3)
> The problem may be in IRA_COVER_CLASSES. -mtune=core2 turns on
> TARGET_INTER_UNIT_MOVES, which means move between mmx and 64bit
> integer registers is cheaper than load/store. But IRA doesn't
> handle it properly.
> 

Is there a way to tell IRA that moving between 2 register
classes is cheaper than load/store?
Comment 5 H.J. Lu 2008-09-04 18:39:06 UTC
Just for the record, move between MMX and GPR isn't a major issue
since we prefer SSE anyway. If it is the only inter class register
move problem for IRA, I am OK to close it as WONTFIX.
Comment 6 Vladimir Makarov 2008-09-04 19:25:16 UTC
  First of all, I've check the generated code on Core2 and I found it is not slower than using movd.

  IRA assigns hard registers calculating their costs.  It the memory is cheaper, it assigns memory.  The first decision point to use memory or hard register is made in ira-costs.c.  The second decision point is ira-color because the cost can changed dynamically (e.g. cheap hard register are not available).

  In this case IRA decides to use memory instead of MMX reg in ira-cost.c

  a0(r61,l0) costs: AREG:25000,25000 DREG:25000,25000 CREG:25000,25000 BREG:25000,25000 SIREG:25000,25000 DIREG:25000,25000 AD_REGS:25000,25000 Q_REGS:25000,25000 NON_Q_REGS:25000,25000 LEGACY_REGS:25000,25000 GENERAL_REGS:25000,25000 SSE_FIRST_REG:42000,42000 SSE_REGS:42000,42000 MMX_REGS:25000,25000 MEM:23000

  The memory is cheaper than MMX_REG therefore r61 gets NO_REGS as cover class which means using memory.

  The reason for this is in insn

(insn:HI 14 8 20 2 /home/cygnus/vmakarov/build/ira-merge-branch/gcc/gcc/testsuite/gcc.target/i386/pr34256.c:12 (set (reg/i:DI 0 ax)
        (subreg:DI (reg:V2SI 61) 0)) 89 {*movdi_1_rex64} (expr_list:REG_DEAD (reg:V2SI 61)
        (nil)))

which has the following description

(define_insn "*movdi_1_rex64"
  [(set (match_operand:DI 0 "nonimmediate_operand"
	  "=r,r  ,r,m ,!m,*y,*y,?r ,m ,?*Ym,?*y,*x,*x,?r ,m,?*Yi,*x,?*x,?*Ym")
	(match_operand:DI 1 "general_operand"
	  "Z ,rem,i,re,n ,C ,*y,*Ym,*y,r   ,m  ,C ,*x,*Yi,*x,r  ,m ,*Ym,*x"))]
  "TARGET_64BIT && !(MEM_P (operands[0]) && MEM_P (operands[1]))"

Please, look at the 8th alternatives ?r *Ym which corresponds to GENERAL_REGS MMX_REGS.  ? makes the alternative expensive. * is even worse because it excludes the alternative from register cost calculation.

  The old register allocator would behave the same way if regmove did not coalesced r61 with another pseudo and the result pseudo had not MMX cost cheaper than memory.

  There are several ways to fix the problem:

o ignore * and ? in the cost calculation
o fix the pattern
o run regmove as the old RA
o make failure expected

The first solution would result in huge performance regression for practically any program.

I can not say about the 2nd solution because I am not the port maintainer.

The third solution is bad because in general case IRA does coalescing more smart on the fly besides it could make RA even slower.

So I'd prefer the last solution.
Comment 7 H.J. Lu 2008-09-04 19:45:20 UTC
(In reply to comment #6)
>   First of all, I've check the generated code on Core2 and I found it is not
> slower than using movd.

I think MMX may be slow anyway.

>   The reason for this is in insn
> 
> (insn:HI 14 8 20 2
> /home/cygnus/vmakarov/build/ira-merge-branch/gcc/gcc/testsuite/gcc.target/i386/pr34256.c:12
> (set (reg/i:DI 0 ax)
>         (subreg:DI (reg:V2SI 61) 0)) 89 {*movdi_1_rex64} (expr_list:REG_DEAD
> (reg:V2SI 61)
>         (nil)))
> 
> which has the following description
> 
> (define_insn "*movdi_1_rex64"
>   [(set (match_operand:DI 0 "nonimmediate_operand"
>           "=r,r  ,r,m ,!m,*y,*y,?r ,m ,?*Ym,?*y,*x,*x,?r ,m,?*Yi,*x,?*x,?*Ym")
>         (match_operand:DI 1 "general_operand"
>           "Z ,rem,i,re,n ,C ,*y,*Ym,*y,r   ,m  ,C ,*x,*Yi,*x,r  ,m ,*Ym,*x"))]
>   "TARGET_64BIT && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
> 
> Please, look at the 8th alternatives ?r *Ym which corresponds to GENERAL_REGS
> MMX_REGS.  ? makes the alternative expensive. * is even worse because it
> excludes the alternative from register cost calculation.

You are right.

>   The old register allocator would behave the same way if regmove did not
> coalesced r61 with another pseudo and the result pseudo had not MMX cost
> cheaper than memory.
> 
>   There are several ways to fix the problem:
> 
> o ignore * and ? in the cost calculation
> o fix the pattern
> o run regmove as the old RA
> o make failure expected
> 
> So I'd prefer the last solution.

I agree, given that it only affects MMX. Uros, what do you think?
Should we mark gcc.target/i386/pr34256.c as XFAIL?
Comment 8 Richard Biener 2008-09-04 20:12:49 UTC
If we decide the test is bogus we should remove it, not xfail it.
Comment 9 H.J. Lu 2008-09-04 20:17:29 UTC
I am concerned about those "*Yi"/"*Ym" and "r" pairs:

[hjl@gnu-6 i386]$ grep "\*Yi" *.md
i386.md:			"=r,m ,*y,*y,?rm,?*y,*x,*x,?r ,m ,?*Yi,*x")
i386.md:			"g ,ri,C ,*y,*y ,rm ,C ,*x,*Yi,*x,r   ,m "))]
i386.md:	  "=r,r  ,r,m ,!m,*y,*y,?r ,m ,?*Ym,?*y,*x,*x,?r ,m,?*Yi,*x,?*x,?*Ym")
i386.md:	  "Z ,rem,i,re,n ,C ,*y,*Ym,*y,r   ,m  ,C ,*x,*Yi,*x,r  ,m ,*Ym,*x"))]
i386.md:  [(set (match_operand:DI 0 "nonimmediate_operand" "=r,?r,?o,?*Ym,?*y,?*Yi,*Y2")
i386.md:  [(set (match_operand:DI 0 "nonimmediate_operand" "=r,o,?*Ym,?*y,?*Yi,*Y2")
[hjl@gnu-6 i386]$ grep "\*Ym" *.md 
i386.md:	  "=r,r  ,r,m ,!m,*y,*y,?r ,m ,?*Ym,?*y,*x,*x,?r ,m,?*Yi,*x,?*x,?*Ym")
i386.md:	  "Z ,rem,i,re,n ,C ,*y,*Ym,*y,r   ,m  ,C ,*x,*Yi,*x,r  ,m ,*Ym,*x"))]
i386.md:	  "=f,m,f,r  ,m ,x,x,x ,m,!*y,!m,!*y,?Yi,?r,!*Ym,!r")
i386.md:	  "fm,f,G,rmF,Fr,C,x,xm,x,m  ,*y,*y ,r  ,Yi,r   ,*Ym"))]
i386.md:  [(set (match_operand:DI 0 "nonimmediate_operand" "=r,?r,?o,?*Ym,?*y,?*Yi,*Y2")
i386.md:  [(set (match_operand:DI 0 "nonimmediate_operand" "=r,o,?*Ym,?*y,?*Yi,*Y2")
[hjl@gnu-6 i386]$ 

IRA is much more sensitive to "*" constraint. We can ignore "*Ym"/"r" pairs.
But should we remove * from "*Yi"/"r" pairs?
Comment 10 H.J. Lu 2008-09-04 20:30:52 UTC
Created attachment 16224 [details]
A patch

I am testing this patch

2008-09-04  H.J. Lu  <hongjiu.lu@intel.com>

        PR target/37364
        * config/i386/i386.md (*movsi_1): Remove * from *Yi.
        (*movdi_1_rex64): Remove * from pairs of r and *Ym/*Yi.
        (zero_extendsidi2_32): Likewise.
        (zero_extendsidi2_rex64): Likewise.
	(*movsf_1): Remove * from *Ym.
Comment 11 Uroš Bizjak 2008-09-06 15:57:14 UTC
(In reply to comment #9)
> I am concerned about those "*Yi"/"*Ym" and "r" pairs:

> IRA is much more sensitive to "*" constraint. We can ignore "*Ym"/"r" pairs.
> But should we remove * from "*Yi"/"r" pairs?

This approach was used to keep register allocator away from moving value from integer ("r") register to non-integer register that can otherwise hold SImode or DImode value. This happened when there was a shortage of "r" registers.

This delicate balance was achieved by using "?" and "*" for MMX and SSE registers in various move patterns. I'm afraid that removing these decorations could (in the worst case) lead to MMX instructions that lock out x87 registers.

IMO, running a simple regmove pass avoids this situation, since this pass will just connect two already used registers together. Due to "*" decoration, allocator won't allocate the register in the move pattern, no matter how hard the register pressure is.

So, due to this, I vote to bring back regmove pass to fix this issue independently of RA. Maybe this pass can even be enhanced a bit to fix PR 19389?
Comment 12 Uroš Bizjak 2008-09-06 16:25:08 UTC
(In reply to comment #11)

> of RA. Maybe this pass can even be enhanced a bit to fix PR 19389?

Eh, PR 19398.

Comment 13 H.J. Lu 2008-09-06 16:49:22 UTC
Vladimir, I will re-enable regmove on ira-merge branch to see its
impact on compile time as well as SPEC CPU 2K/2006.
Comment 14 H.J. Lu 2008-09-09 16:01:16 UTC
I re-enabled the regmove pass on ira-merge branch at revision 140065
and ran SPEC CPU 2K/2006 with -O2 -msse2 -mfpmath=sse -ffast-math for
both 32bit and 64bit on Intel Core 2. Here are the performance impacts
of the regmove pass:

32bit:

164.gzip 			 -0.597213%
175.vpr 			 0.42898%
176.gcc 			 2.66576%
181.mcf 			 -0.913938%
186.crafty 			 -0.514019%
197.parser 			 0.185644%
252.eon 			 1.10181%
253.perlbmk 			 -2.17957%
254.gap 			 -0.993521%
255.vortex 			 0.209644%
256.bzip2 			 -0.294551%
300.twolf 			 0.131363%
SPECint_base2000 			 -0.0744109%

168.wupwise 			 0.45045%
171.swim 			 0.653595%
172.mgrid 			 4.23729%
173.applu 			 1.31265%
177.mesa 			 -2.31596%
178.galgel 			 -0.410633%
179.art 			 0.504994%
183.equake 			 -0.744153%
187.facerec 			 4.28781%
188.ammp 			 1.28707%
189.lucas 			 13.7548%
191.fma3d 			 -0.53528%
200.sixtrack 			 1.49551%
301.apsi 			 0.27088%
SPECfp_base2000 			 1.66845%

400.perlbench 			 1.52284%
401.bzip2 			 -1.36986%
403.gcc 			 0.497512%
429.mcf 			 0%
445.gobmk 			 -1.08696%
456.hmmer 			 0%
458.sjeng 			 0%
462.libquantum 			 -0.490196%
464.h264ref 			 -20.1581%
471.omnetpp 			 0.689655%
473.astar 			 0.826446%
483.xalancbmk 			 0%
SPECint(R)_base2006 			 -2.1978%

410.bwaves 			 0%
416.gamess 			 1.37931%
433.milc 			 0%
434.zeusmp 			 -3.0303%
435.gromacs 			 -6.19469%
436.cactusADM 			 -8.57143%
437.leslie3d 			 0%
444.namd 			 1.5625%
447.dealII 			 -0.446429%
450.soplex 			 0%
453.povray 			 1.26582%
454.calculix 			 1.59884%
459.GemsFDTD 			 1.76991%
465.tonto 			 -0.793651%
470.lbm 			 -5.64103%
481.wrf 			 0.917431%
482.sphinx3 			 4.69484%
SPECfp(R)_base2006 			 -1.41844%

64bit:

164.gzip 			 -0.124611%
175.vpr 			 -0.145208%
176.gcc 			 0.172831%
181.mcf 			 0.143833%
186.crafty 			 -0.934579%
197.parser 			 0.122549%
252.eon 			 -0.465839%
253.perlbmk 			 0.565348%
254.gap 			 -0.158983%
255.vortex 			 0.0648298%
256.bzip2 			 0.0863931%
300.twolf 			 0.346566%
SPECint_base2000 			 -0.0274812%

68.wupwise 			 1.79766%
171.swim 			 0.389105%
172.mgrid 			 -1.01574%
173.applu 			 4.83871%
177.mesa 			 -0.634026%
178.galgel 			 0.239704%
179.art 			 0.71311%
183.equake 			 1.14286%
187.facerec 			 2.05011%
188.ammp 			 -1.99076%
189.lucas 			 2.137%
191.fma3d 			 -0.20816%
200.sixtrack 			 0.303951%
301.apsi 			 0.0847099%
SPECfp_base2000 			 0.690346%

400.perlbench 			 -0.440529%
401.bzip2 			 -0.564972%
403.gcc 			 -0.5%
429.mcf 			 -0.510204%
445.gobmk 			 -1.05263%
456.hmmer 			 2.7027%
458.sjeng 			 1.5544%
462.libquantum 			 0.45045%
464.h264ref 			 0.374532%
471.omnetpp 			 -0.70922%
473.astar 			 0%
483.xalancbmk 			 0.507614%
SPECint(R)_base2006 			 0%

410.bwaves 			 0%
416.gamess 			 -0.568182%
433.milc 			 1.45985%
434.zeusmp 			 0%
435.gromacs 			 3.22581%
436.cactusADM 			 -6.66667%
437.leslie3d 			 0%
444.namd 			 0%
447.dealII 			 1.13636%
450.soplex 			 0.546448%
453.povray 			 -0.900901%
454.calculix 			 1.98598%
459.GemsFDTD 			 2.27273%
465.tonto 			 0%
470.lbm 			 -2.06186%
481.wrf 			 0.8%
482.sphinx3 			 0.881057%
SPECfp(R)_base2006 			 0%

At -O2, the impact of the regmove pass is mixed. However,
for -mtune=core2 is used, the results could be different.
See PR 37437.
Comment 15 Paolo Bonzini 2008-09-11 13:46:25 UTC
Uros, does your comment #11 apply also to SSE registers (Yi), which could alleviate for PR 37437, or only to MMX (Ym)?
Comment 16 Uroš Bizjak 2008-09-11 17:33:10 UTC
(In reply to comment #15)
> Uros, does your comment #11 apply also to SSE registers (Yi), which could
> alleviate for PR 37437, or only to MMX (Ym)?

Comment #11 is also true for Yi SSE registers (for TARGET_INTER_UNIT_MOVES targets, i.e. -march=core2). Under integer register shortage, allocator will start to allocate SSE registers, and reload will later generate various xmm->reg moves to satisfy subsequent instruction constraints.

Comment 17 Joey Ye 2008-10-23 08:42:59 UTC
CPU2006/454.calculix has about 10% regression with IRA + core2 + fpmath=sse on Core2 ix86:
                 IRA    IRA_core2   NO_IRA_core2
454.calculix     1.00   0.90        1.01

Revision: trunk 140514

Options in detail:
IRA= -m32 -O2 -mssse3 -mfpmath=sse
IRA_core2= $IRA -mtune=core2
NO_IRA_core2= $IRA_core2 -fno-ira
Comment 18 Joey Ye 2008-10-24 08:36:37 UTC
Created attachment 16536 [details]
Reduced performance case from cpu2006/454.calculix

50% regression with IRA core2 on trunk revsion 140514 and 141335

$ gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../src/configure --disable-bootstrap --enable-languages=c,c++,fortran --enable-checking=assert
Thread model: posix
gcc version 4.4.0 20081024 (experimental) [trunk revision 141335] (GCC) 
$ gcc -m32 -O2 -mssse3 -mfpmath=sse 36.c
$ time -p ./a.out
real 7.97
$ gcc -m32 -O2 -mssse3 -mfpmath=sse -mtune=core2 -o core2.exe 36.c
$ time -p ./core2.exe
real 12.27
$ gcc -m32 -O2 -mssse3 -mfpmath=sse -mtune=core2 -fno-ira -o no-ira.exe 36.c
$ time -p ./no-ira.exe
real 8.03
Comment 19 Paolo Bonzini 2008-10-24 10:11:10 UTC
Left = old, right = IRA.

It seems to me that the better register allocation of IRA gives the post-regalloc scheduling pass much less freedom.

Intel guys, could you run SPEC with -O2 -fschedule-insns and -O2, both of them using IRA?

.L4:
        movsd   (%esi,%eax,8), %xmm3                    movl    12(%ebp), %edx
        movsd   (%ebx,%eax,8), %xmm4                    movsd   (%edx,%eax,8), %xmm7
        movsd   (%ecx,%eax,8), %xmm6                    movl    -44(%ebp), %edx
        movl    12(%ebp), %edx                          movsd   %xmm7, -40(%ebp)
        movsd   (%edx,%eax,8), %xmm1                    movsd   (%edx,%eax,8), %xmm7
        movl    16(%ebp), %edx                          movsd   %xmm7, -56(%ebp)
        movapd  %xmm1, %xmm0                            movsd   -40(%ebp), %xmm7
        movsd   (%edx,%eax,8), %xmm2                    mulsd   (%ebx,%eax,8), %xmm7
        mulsd   %xmm3, %xmm0                            addsd   %xmm7, %xmm6
        movl    20(%ebp), %edx                          movsd   -40(%ebp), %xmm7
        addsd   -80(%ebp), %xmm0                        mulsd   (%esi,%eax,8), %xmm7
        movsd   (%edx,%eax,8), %xmm5                    addsd   %xmm7, %xmm5
        movsd   %xmm0, -80(%ebp)                        movsd   -40(%ebp), %xmm7
        incl    %eax                                    mulsd   (%edi,%eax,8), %xmm7
        movapd  %xmm1, %xmm0                            addsd   %xmm7, %xmm4
        cmpl    %eax, %edi                              movsd   -56(%ebp), %xmm7
        mulsd   %xmm4, %xmm0                            mulsd   (%ebx,%eax,8), %xmm7
        mulsd   %xmm6, %xmm1                            addsd   %xmm7, %xmm3
        addsd   -72(%ebp), %xmm0                        movsd   -56(%ebp), %xmm7
        addsd   -64(%ebp), %xmm1                        mulsd   (%esi,%eax,8), %xmm7
        movsd   %xmm0, -72(%ebp)                        addsd   %xmm7, %xmm2
        movsd   %xmm1, -64(%ebp)                        movsd   -56(%ebp), %xmm7
        movapd  %xmm2, %xmm0                            mulsd   (%edi,%eax,8), %xmm7
        mulsd   %xmm3, %xmm0                            addsd   %xmm7, %xmm1
        mulsd   %xmm5, %xmm3                            movsd   (%ecx,%eax,8), %xmm7
        addsd   -56(%ebp), %xmm0                        mulsd   (%ebx,%eax,8), %xmm7
        addsd   -32(%ebp), %xmm3                        addsd   -32(%ebp), %xmm7
        movsd   %xmm0, -56(%ebp)                        movsd   %xmm7, -32(%ebp)
        movsd   %xmm3, -32(%ebp)                        movsd   (%ecx,%eax,8), %xmm7
        movapd  %xmm2, %xmm0                            mulsd   (%esi,%eax,8), %xmm7
        mulsd   %xmm6, %xmm2                            addsd   -24(%ebp), %xmm7
        mulsd   %xmm4, %xmm0                            movsd   %xmm7, -24(%ebp)
        addsd   -40(%ebp), %xmm2                        movsd   (%ecx,%eax,8), %xmm7
        mulsd   %xmm5, %xmm4                            mulsd   (%edi,%eax,8), %xmm7
        addsd   -48(%ebp), %xmm0                        incl    %eax
        addsd   -24(%ebp), %xmm4                        addsd   %xmm7, %xmm0
        mulsd   %xmm6, %xmm5                            cmpl    %eax, 8(%ebp)
        movsd   %xmm0, -48(%ebp)                        jg      .L4
        movsd   %xmm2, -40(%ebp)
        movsd   %xmm4, -24(%ebp)
        addsd   %xmm5, %xmm7
        jg      .L4
Comment 20 Paolo Bonzini 2008-10-24 10:16:07 UTC
Why doesn't bugzilla have a preview?  Run the previous comment through

   sed -e '/, *$/\!b; N; y/\n/ /'

to get the side-by-side outputs.
Comment 21 Joey Ye 2008-10-25 04:14:00 UTC
To me scheduler is irrelevant here. GCC has no core2 pipeline description so the instruction scheduling doesn't looks optimized. But for OOO processor like core2, IMHO scheduling shouldn't make that much difference. Also core2 + no-ira doesn't hurt, which means core2 scheduling is not the root cause.

Instead old code uses different register for loading, but IRA code always uses xmm7 as load target. Need to figure out two questions:
1. why instructions from core2+ira runs slower than ira?
2. why core2+ira generate so different code as non-core2?

Scheduler dump for core2:
;;      insn  code    bb   dep  prio  cost   reservation
;;      ----  ----    --   ---  ----  ----   -----------
;;      108    47     4     0     0     0   nothing     : 70 109 43
;;       43   102     4     1     0     0   nothing     : 70 51 117 114 67 109
;;      109    47     4     2     0     0   nothing     : 70 44
;;       44   102     4     1     0     0   nothing     : 70 57 55 59 67
;;       45   102     4     0     0     0   nothing     : 70 65 67 112 110
;;       46   102     4     0     0     0   nothing     : 70 55 49 67 61
;;      110   102     4     1     0     0   nothing     : 70 65 61
;;       61   720     4     2     0     0   nothing     : 70 55 62
;;       62   720     4     1     0     0   nothing     : 70 47 111
Comment 22 H.J. Lu 2008-10-28 01:12:05 UTC
Created attachment 16571 [details]
A patch to re-enable regmove

After applying this patch to re-enable regmove, I got

[hjl@gnu-6 gcc]$ ./xgcc -B./ -O2 -mtune=core2 /tmp/foo.c -o noira -fno-ira -m32
[hjl@gnu-6 gcc]$ ./xgcc -B./ -O2 -mtune=core2 /tmp/foo.c -o ira -m32
[hjl@gnu-6 gcc]$ time ./ira

real    0m7.307s
user    0m7.305s
sys     0m0.001s
[hjl@gnu-6 gcc]$ time ./noira

real    0m7.478s
user    0m7.474s
sys     0m0.002s
[hjl@gnu-6 gcc]$
Comment 23 Joey Ye 2008-10-28 01:19:50 UTC
(In reply to comment #22)
> Created an attachment (id=16571) [edit]
> A patch to re-enable regmove
> After applying this patch to re-enable regmove, I got
> [hjl@gnu-6 gcc]$ ./xgcc -B./ -O2 -mtune=core2 /tmp/foo.c -o noira -fno-ira -m32
HJ, is your foo.c the case attached in comment #18?
Comment 24 H.J. Lu 2008-10-28 01:36:26 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > Created an attachment (id=16571) [edit]
> > A patch to re-enable regmove
> > After applying this patch to re-enable regmove, I got
> > [hjl@gnu-6 gcc]$ ./xgcc -B./ -O2 -mtune=core2 /tmp/foo.c -o noira -fno-ira -m32
> HJ, is your foo.c the case attached in comment #18?
> 

Yes.
Comment 25 H.J. Lu 2008-11-04 19:28:58 UTC
This regression isn't specific to -mtune=core2. I saw 3% regression
with -O3 on Intel64. Enable the full regmove pass fixes the regression.
Comment 26 Steven Bosscher 2008-11-30 20:45:19 UTC
Resurrecting regmove is not an option. Time is better spent on figuring out what regmove does, that makes a difference, and see if IRA can be taught to do the same.
Comment 27 H.J. Lu 2008-11-30 20:52:32 UTC
(In reply to comment #26)
> Resurrecting regmove is not an option. Time is better spent on figuring out
> what regmove does, that makes a difference, and see if IRA can be taught to do
> the same.
> 

x86 has

(define_insn "*movdi_1_rex64"
  [(set (match_operand:DI 0 "nonimmediate_operand"
          "=r,r  ,r,m ,!m,*y,*y,?r ,m ,?*Ym,?*y,*x,*x,?r ,m,?*Yi,*x,?*x,?*Ym")
        (match_operand:DI 1 "general_operand"
          "Z ,rem,i,re,n ,C ,*y,*Ym,*y,r   ,m  ,C ,*x,*Yi,*x,r  ,m ,*Ym,*x"))]
  "TARGET_64BIT && !(MEM_P (operands[0]) && MEM_P (operands[1]))"

The alternative with * is available for regmove, but not for
IRA/reload. I don't know how you can resolve it without a different
pass.
Comment 28 Steven Bosscher 2008-11-30 21:18:42 UTC
You're not explaining what regmove does. What transformation do these alternatives with "*" enable regmove to do?

I'm not saying that a separate pass is not an option. Perhaps a regmove-like pass is necessary.  In fact it probably is, I think even Vlad already acknowledged so. But not regmove as-we-know-it, which is a friggin' mess that ought to go away.

What we should figure out IMHO, is which transformation(s) it is (are) that regmove does, which are helpful here.  Maybe we can add a simple pre-regalloc pass that does these transformations, or make it part of an existing pass, using clean infrastructure (cfg, df) instead of the ad-hoc mess of regmove.


Comment 29 Steven Bosscher 2008-11-30 21:32:56 UTC
The insns 8 in comment #0 show the regmove transformation that matters here:

With regmove disabled::
(insn:HI 8 7 14 2 ../../include/mmintrin.h:300 (set (reg:V2SI 61)
        (plus:V2SI (reg:V2SI 63 [ x ])
            (mem/c/i:V2SI (symbol_ref:DI ("y") <var_decl 0x7f66abfb5c80 y>) [2
y
+0 S8 A64]))) 992 {*mmx_addv2si3} (expr_list:REG_DEAD (reg:V2SI 63 [ x ])
        (nil)))


vs. with regmove enabled:
(insn:HI 8 7 14 2 ../../include/mmintrin.h:300 (set (reg:V2SI 63 [ x ])
        (plus:V2SI (reg:V2SI 63 [ x ])
            (mem/c/i:V2SI (symbol_ref:DI ("y") <var_decl 0x7fd36e03cc80 y>) [2 
y+0 S8 A64]))) 992 {*mmx_addv2si3} (nil))


This is a "textbook" example of a regmove transformation (where the use of the word "textbook" is an insult to all textbooks ever written, since nothing in regmove is "textbook", but oh well...).  It's turned a 3-address instruction into a 2-address instruction, and this is precisely the transformation that regmove was originally written for.

In IRA, I would've expected reg 61 and reg 63 to be coalesced to give the same result.

Vlad already commented on this in comment #6.  It's clear from his description why IRA did not coalesce reg 61 and reg 63.  

Other compilers do this kind of transformation via reverse copy propagation.  GCC could perhaps add something like that too, when it transforms a 3-address insn to a 2-address insn.

Comment 30 Uroš Bizjak 2008-12-01 18:26:32 UTC
(In reply to comment #29)

> Other compilers do this kind of transformation via reverse copy propagation. 
> GCC could perhaps add something like that too, when it transforms a 3-address
> insn to a 2-address insn.

Will this also solve PR 19398, where we have:

        fstps  -4(%ebp)
(*)     movss  -4(%ebp), %xmm0
(*)     cvttss2si       %xmm0, %eax
        leave

Note, that we could simply have:

        cvttss2si       -4(%ebp), %eax

Comment 31 H.J. Lu 2008-12-21 18:04:10 UTC
*** Bug 38601 has been marked as a duplicate of this bug. ***
Comment 32 H.J. Lu 2009-01-29 17:13:14 UTC
This has been fixed by revision 143757:

http://gcc.gnu.org/ml/gcc-patches/2009-01/msg01410.html
Comment 33 H.J. Lu 2009-01-29 17:57:26 UTC
Reopen since revision 143757 isn't supposed to fix it.
Comment 34 Paolo Bonzini 2009-02-04 07:57:19 UTC
> Reopen since revision 143757 isn't supposed to fix it.

Who cares if 143757 "isn't supposed to fix it", it *is* fixed:

	movq	x(%rip), %mm0
	paddd	y(%rip), %mm0
	movd	%mm0, %rax
	ret

Comment 35 Paolo Bonzini 2009-02-04 07:57:55 UTC
(and all bugs depending on this one are also fixed)