Bug 19653

Summary: x87 reg allocated for constants for -mfpmath=sse
Product: gcc Reporter: Uroš Bizjak <ubizjak>
Component: targetAssignee: Paolo Bonzini <bonzini>
Status: RESOLVED FIXED    
Severity: minor CC: bonzini, dalej, fjahanian, gcc-bugs, matz, pawel_sikora, rguenth, rhill
Priority: P2 Keywords: missed-optimization, patch, ra
Version: 4.0.0   
Target Milestone: ---   
URL: http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01825.html
Host: Target: i686-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2005-07-27 15:53:05
Bug Depends on:    
Bug Blocks: 23111    
Attachments: Self-contained example
updated patch
Re-updated patch (reload.c only)
another patch update
one more updated patch

Description Uroš Bizjak 2005-01-27 08:34:00 UTC
A self-contained testcase is attached to this bugreport. Please compile it with
-O2 -march=pentium4 -mfpmath=sse -ffast-math.

This part of the testcase:
  ...
  if ( fabs(fabs(NewRay->Direction[Z])- 1.) < .1 ) {
    /* too close to vertical for comfort, so use cross product with horizon */
    up[X] = 0.; up[Y] = 1.; up[Z] = 0.;
  }
  else
  {
    up[X] = 0.; up[Y] = 0.; up[Z] = 1.;
  }

  VCross(n2, NewRay->Direction, up);  VNormalizeEq(n2);
  VCross(n3, NewRay->Direction, n2);  VNormalizeEq(n3);
  ...

will generate:
	jae	.L2
	fldz
	fstl	-112(%ebp)
	movsd	-112(%ebp), %xmm2
	fld1
	fstpl	-112(%ebp)
	movsd	-112(%ebp), %xmm1
.L5:
	fldl	32(%ebx)
	fstpl	-96(%ebp)
	movsd	-96(%ebp), %xmm3
	mulsd	%xmm2, %xmm3
	movapd	%xmm5, %xmm0
	mulsd	%xmm1, %xmm0
	subsd	%xmm0, %xmm3
	fldl	24(%ebx)
	fstpl	-88(%ebp)
	mulsd	-88(%ebp), %xmm2
	xorpd	.LC5, %xmm2
	movsd	-88(%ebp), %xmm4
        ...
.L2:
	fld1
	fstpl	-112(%ebp)
	movsd	-112(%ebp), %xmm2
	fldz
	fstl	-112(%ebp)
	movsd	-112(%ebp), %xmm1
	jmp	.L5

Uros.
Comment 1 Uroš Bizjak 2005-01-27 08:35:20 UTC
Created attachment 8080 [details]
Self-contained example

Self-contained example
Comment 2 Uroš Bizjak 2005-01-27 09:14:22 UTC
A patch (RFC actually) that was used to teach allocator which register set to use:
http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01783.html
[This patch probably doesn't work well with xmmintrin.h stuff as pointed by rth
in follow-up comment.]
Comment 3 Uroš Bizjak 2005-01-28 06:23:54 UTC
BTW: I don't think that x87 should be fully disabled for -mfpmath=sse. st(0) can
be used as a temporary storage for memory-to-memory transfers. Also, it can do
on-the-fly FP extending and truncating, without touching a SSE reg:

movsd (%eax),xmm1      # ~7 cycles
cvtsd2ss xmm1,(%esp)   # 14 cycles


could be implemented by:

fldl (%eax)   # ~7cycles
fstps (%esp) # ~7cycles


There is nothing wrong, if fldl is replaced with fldz or fld1. The performance
problems will arise in case when memory location is used in subsequent SSE
computations. In this case, it would be better if zero is "generated" in SSE
register and stored to memory from SSE reg.
Comment 4 Steven Bosscher 2005-02-08 10:13:13 UTC
rth hacked the constraints recently to have better ra for some fp cases.  Can
you see if the bug is still there today on mainline?
Comment 5 Uroš Bizjak 2005-02-09 07:30:30 UTC
(In reply to comment #4)
> rth hacked the constraints recently to have better ra for some fp cases.  Can
> you see if the bug is still there today on mainline?

gcc version 4.0.0 20050209 (experimental)
'-O2 -march=pentium4 -mfpmath=sse -ffast-math -D__NO_MATH_INLINES'

	...
	comisd	.LC2, %xmm0
	jb	.L2
	fld1                        <- load of 1 into SSE reg
	fstpl	-112(%ebp)
	movsd	-112(%ebp), %xmm2
	fldz                        <- load of 0 into SSE reg
	fstl	-112(%ebp)
	movsd	-112(%ebp), %xmm1
	movapd	%xmm1, %xmm0
.L4:
	fldl	32(%ebx)               movsd 32(%ebx), %xmm3
	fstpl	-96(%ebp)              movsd %xmm3, -96(%ebp)
	movsd	-96(%ebp), %xmm3       [ -not needed- ]
	mulsd	%xmm2, %xmm3
	subsd	%xmm0, %xmm3
	fldl	24(%ebx)            <-move to temporary, this is OK
	fstpl	-88(%ebp)
	mulsd	-88(%ebp), %xmm2
	xorpd	.LC4, %xmm2
	movsd	-88(%ebp), %xmm4
	mulsd	%xmm1, %xmm4
	...

The bug is still there (it manifests itself also in a couple of other places).
Could somebody confirm this bug? A testcase is attached.
Comment 6 Andrew Pinski 2005-06-19 14:58:01 UTC
Confirmed.
Comment 7 Andrew Pinski 2005-07-13 11:30:17 UTC
*** Bug 22453 has been marked as a duplicate of this bug. ***
Comment 8 Paolo Bonzini 2005-07-13 11:51:15 UTC
Smaller testcase from PR22453:


extern double f(double, double);
void g (double x)
{
  double z, y;

  z = 0.0;
  y = 1.0 - x;

again:
  z = y - z;
  f(z, 1.0);
  if (z == 0.0)
    goto again;
}

has a fld1 instruction when compiled with "-mfpmath=sse -msse2 -msseregparm
-mtune=pentiumpro -O2".

This instruction is caused by a reload into a FLOAT_REGS register, and moving
the value to a SSE register needs secondary memory.
Comment 9 Michael Matz 2005-07-13 13:55:02 UTC
I was going to add this text to PR22453, when I noticed that it was closed 
as duplicate to this one.  So putting it here for reference, although 
everything seems to be analyzed already: 
 
The reload happens, because reg 58 gets no hardreg, because it's live over  
a call, and it's not worthwhile to put it into a call clobbered reg (which  
SSE regs are).  So reg 58 is placed onto stack (at ebp+16).  Now this mem must  
be initialized with 1.0.  If that is done via x87 (fld1 , fst ebp+16), via  
GENERAL_REGS (mov 1.0 -> (reg:DF ax) , mov (reg:DF ax) -> (ebp+16)), or via  
SSE_REGS (movsd (mem 1.0) -> xmm0 , mov xmm0 -> (ebp+16)) is actually not  
that important.  You won't get rid of this reload.  
  
Except that _if_ you force it to use SSE_REGS, then the next reload from  
(ebp+16) for the next insn can be inherited (as it's then the same mode),  
hence the initial store to ebp+16 is useless and will be removed.  
  
This can be tested with this hack:  
--- i386.md     12 Jul 2005 09:20:12 -0000      1.645  
+++ i386.md     13 Jul 2005 13:47:06 -0000  
@@ -2417,9 +2417,9 @@  
  
 (define_insn "*movdf_nointeger"  
   [(set (match_operand:DF 0 "nonimmediate_operand"  
-                       "=f#Y,m  ,f#Y,*r  ,o  ,Y*x#f,Y*x#f,Y*x#f  ,m    ")  
+                       "=?f#Y,m  ,f#Y,*?r  ,o  ,Y*x#f,Y*x#f,Y*x#f  ,m    ")  
        (match_operand:DF 1 "general_operand"  
-                       "fm#Y,f#Y,G  ,*roF,F*r,C    ,Y*x#f,HmY*x#f,Y*x#f"))]  
+                       "?fm#Y,f#Y,G  ,*?roF,F*r,C    ,Y*x#f,HmY*x#f,Y*x#f"))]  
   "(GET_CODE (operands[0]) != MEM || GET_CODE (operands[1]) != MEM)  
    && ((optimize_size || !TARGET_INTEGER_DFMODE_MOVES) && !TARGET_64BIT)  
    && (reload_in_progress || reload_completed  
  
But I don't see immediately how reload could be convinced to do so  
automatically, as the choice of the reload class for one insn is independend  
from the choices of reloads for the same reg but in other insns.  
Comment 10 Paolo Bonzini 2005-07-13 14:08:31 UTC
http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01783.html crashes on richard
guenther's libsse2, fwiw.
Comment 11 Paolo Bonzini 2005-07-13 14:23:07 UTC
Michael, thank you very much.  Your analysis will probably help a lot.
Comment 12 Paolo Bonzini 2005-07-14 08:45:35 UTC
> But I don't see immediately how reload could be convinced to do so  
> automatically, as the choice of the reload class for one insn is independend  
> from the choices of reloads for the same reg but in other insns.  

We can use PREFERRED_RELOAD_CLASS and PREFERRED_OUTPUT_RELOAD_CLASS.  I am not
sure if the fix is over-eager, but it works great on libsse2.  Every fld that
was there disappears, and from a cursory check, in current mainline's code all
of them could have been inherited.

Patches are at:
- http://gcc.gnu.org/ml/gcc-patches/2005-07/msg00914.html (regclass)
- http://gcc.gnu.org/ml/gcc-patches/2005-07/msg00983.html (reload+i386)

It is also necessary to remove from the MD the `#' hints for regclass, with

  sed -i 's/#[^",\][^",]*\([",]\)/\1/g' i386.md

Paolo
Comment 13 Paolo Bonzini 2005-07-27 15:56:10 UTC
Created attachment 9374 [details]
updated patch

updated patch, should fix perlbmk and twolf failures in previous versions of
the patch

the failures were reproducible on x86_64 with

  ./cc1 -O2 -fvisibility=hidden -quiet	write2.i -o -  -da -quiet -march=k8 
  -mfpmath=sse -m32

where write2.i is

  extern double f;

  void dbl (long long value)
  {
    double g = (double) (value);
    __asm__ (";;;" : : "t" (g) : "xmm0", "xmm1",
	     "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7");
    f = g + 2.0;
  }

(the asm is only there to force selection of the *floatdidf2_387 pattern)
Comment 14 Dale Johannesen 2005-08-01 20:56:45 UTC
Unfortunately the latest version of this patch causes a bootstrap failure on ppc:

../../gcc3.apple.200502/gcc/reload.c: In function 'find_reloads':
../../gcc3.apple.200502/gcc/reload.c:4512: internal compiler error: in do_output_
reload, at reload1.c:6936

which is

  /* If is a JUMP_INSN, we can't support output reloads yet.  */
  gcc_assert (!JUMP_P (insn));

Digging further.
Comment 15 Dale Johannesen 2005-08-02 22:55:37 UTC
Created attachment 9418 [details]
Re-updated patch (reload.c only)
Comment 16 Dale Johannesen 2005-08-02 22:57:53 UTC
Preceding patch fixes the ICE I was getting.  The tests following the modified area in find_reloads were 
being skipped in cases where they weren't before (in particular, when output reloads are not allowed,
this was not detected).  The revised patch moves the area Paolo modified below those tests.  Going 
through more testing now.
Comment 17 Paolo Bonzini 2005-08-08 16:39:01 UTC
Created attachment 9445 [details]
another patch update

More problems... below this comment

	      /* Alternative loses if it requires a type of reload not
		 permitted for this insn.  We can always reload SCRATCH
		 and objects with a REG_UNUSED note.  */

there is an "else if" that must be changed to an "if".	Attaching the patch
that fixes this.
Comment 18 Paolo Bonzini 2005-08-09 07:40:24 UTC
Created attachment 9454 [details]
one more updated patch

Note that http://gcc.gnu.org/ml/gcc/2005-07/txt00007.txt should be applied
together with this patch, because each patch helps the other.
Comment 19 Paolo Bonzini 2005-08-22 22:03:08 UTC
SPEC results for i686-pc-linux-gnu follow.  The only significant regression is
in galgel, overall it's about 1% better for SPECint and 2% better for SPECfp.

Note that crafty improves a lot because of Dale's patch.

   164.gzip          1400     151         925*     1400     152         921*
   175.vpr           1400     163         859*     1400     162         862*
   176.gcc           1100      70.9      1552*     1100      71.1      1548*
   181.mcf           1800     176        1021*     1800     176        1020*
   186.crafty        1000     107         938*     1000     102         984*
   197.parser        1800     205         877*     1800     205         876*
   252.eon           1300     149         873*     1300     141         919*
   253.perlbmk       1800     126        1434*     1800     123        1464*
   254.gap           1100      82.1      1340*     1100      80.8      1361*
   255.vortex        1900     134        1415*     1900     134        1413*
   256.bzip2         1500     161         930*     1500     161         933*
   300.twolf         3000     235        1276*     3000     234        1281*
   SPECint_base2000                      1093
   SPECint2000                                                         1106

   168.wupwise       1600     174         920*     1600     173         926*
   171.swim          3100     180        1721*     3100     181        1713*
   172.mgrid         1800     257         700*     1800     257         701*
   173.applu         2100     200        1049*     2100     199        1056*
   177.mesa          1400     125        1116*     1400     123        1138*
   178.galgel        2900     305         952*     2900     312         930*
   179.art           2600     216        1206*     2600     212        1229*
   183.equake        1300      80.5      1615*     1300      76.1      1708*
   187.facerec       1900     286         664*     1900     286         664*
   188.ammp          2200     305         721*     2200     298         739*
   189.lucas         2000     225         889*     2000     203         983*
   191.fma3d         2100     397         530*     2100     373         563*
   200.sixtrack      1100     188         587*     1100     188         586*
   301.apsi          2600     262         991*     2600     261         996*
   SPECfp_base2000                        921
   SPECfp2000                                                           939

                                    NOTES
                                    -----
     Base flags: -O2 -msse -msse2 -mfpmath=sse
     Peak flags: -O2 -msse -msse2 -mfpmath=sse
Comment 20 Fariborz Jahanian 2005-09-20 22:34:24 UTC
(In reply to comment #18)
> Created an attachment (id=9454)
> one more updated patch
> 
> Note that http://gcc.gnu.org/ml/gcc/2005-07/txt00007.txt should be applied
> together with this patch, because each patch helps the other.

We have the entire patch in our apple local branch. This patch causes performance degredation in a 
user benchmark and is summarized in this this case:

typedef int __m64 __attribute__ ((__vector_size__ (8)));

static __inline __m64 __attribute__((__always_inline__))
_mm_add_si64 (__m64 __m1, __m64 __m2)
{
  return (__m64) __builtin_ia32_paddq ((long long)__m1, (long long)__m2);
}


__m64 unsigned_add3( const __m64 *a, const __m64 *b, __m64 *result, unsigned long count )
{
  __m64 sum, _a, _b;
  unsigned int i;

  _a = a[0];
  _b = b[0];

  sum = _mm_add_si64( _a, _b );
  for( i = 1; i < count; i++ )
  {
   result[i-1] = sum;
   _a = a[i];
   _b = b[i];
   sum = _mm_add_si64( _a, _b );
  }
  return sum;
}

Compiling this test case with -O1 (-fPIC -march=pentium4 are defaults), we get:
L4:
        movl    20(%ebp), %edi
        movl    %eax, -8(%edi,%ecx,8)
        movl    %edx, -4(%edi,%ecx,8)
        leal    0(,%ecx,8), %eax
        movl    16(%ebp), %edx
        movq    (%edx,%eax), %mm0
        movq    %mm0, -32(%ebp)        <--------
        movq    -32(%ebp), %mm0         <---------
        movl    12(%ebp), %edx
        paddq   (%edx,%eax), %mm0
        movq    %mm0, -40(%ebp)
        movl    -40(%ebp), %eax
        movl    -36(%ebp), %edx
        addl    $1, %ecx
        cmpl    %ecx, %esi
        jne     L4
! Note the redundant store and load.

While FSF gcc-4.0 produces the following loop (no redundancy):

L4:
        movl    20(%ebp), %edi
        movl    %eax, -8(%edi,%ecx,8)
        movl    %edx, -4(%edi,%ecx,8)
        leal    0(,%ecx,8), %eax
        movl    12(%ebp), %edx
        movq    (%edx,%eax), %mm0
        movl    16(%ebp), %edi
        paddq   (%edi,%eax), %mm0
        movq    %mm0, -16(%ebp)
        movl    -16(%ebp), %eax
        movl    -12(%ebp), %edx
        addl    $1, %ecx
        cmpl    %ecx, %esi
        jne     L4

The main culprit in the patch is this statement in regclass.c
          /* If no register class is better than memory, use memory. */
          if (p->mem_cost < best_cost)
            best = NO_REGS;

For the following RTL:

(insn 21 19 23 0 (set (reg/v:V2SI 63 [ sum ])
        (subreg:V2SI (reg:DI 69) 0)) 736 {*movv2si_internal} (insn_list:REG_DEP_TRUE 19 (nil))
    (expr_list:REG_DEAD (reg:DI 69)
        (nil)))


Cost computation says MMX_REGS is inferior to store to memory. I spent some time trying to find out 
why the cost of memory is less than cost of keeping the operation in register. This is the pattern which 
causes the cost of register to be higher (in mmx.md):

(define_insn "*mov<mode>_internal"
  [(set (match_operand:MMXMODEI 0 "nonimmediate_operand"
                        "=*y,*y ,m ,*y,*Y,*Y,*Y ,m ,*x,*x,*x,m ,?r ,?m")
        (match_operand:MMXMODEI 1 "vector_move_operand"
                        "C  ,*ym,*y,*Y,*y,C ,*Ym,*Y,C ,*x,m ,*x,irm,r"))]

Note that in this pattern cost computation of MMX_REGS  are all ignored ('*' in front of y). So, the cost 
which is computed is for 'r' which is GENERAL_REGS. This cost is too high and eventually results in 
memory cost to be lower than register cost. I tried the following simple patch as experiment and got all 
the performance back (it is now comparable with 4.0). Note that in this patch, I removed the '*' in the 
2nd alternative so cost of keeping the operand in mmx_regs class is factored in. This resulted in a 
lower cost than that of memory. Is this the way to go? This is just an experiment which seems to work. 
Bu I am wondering of its implication as most of other alternatives also have their cost ignored in this 
and similar patterns:

Here is the patch:

RCS file: /cvs/gcc/gcc/gcc/config/i386/mmx.md,v
retrieving revision 1.6.14.3
diff -c -p -r1.6.14.3 mmx.md
*** config/i386/mmx.md  9 Jul 2005 22:06:34 -0000       1.6.14.3
--- config/i386/mmx.md  20 Sep 2005 21:23:52 -0000
***************
*** 87,95 ****

  (define_insn "*mov<mode>_internal"
    [(set (match_operand:MMXMODEI 0 "nonimmediate_operand"
!                       "=*y,*y ,m ,*y,*Y,*Y,*Y ,m ,*x,*x,*x,m ,?r ,?m")
        (match_operand:MMXMODEI 1 "vector_move_operand"
!                       "C  ,*ym,*y,*Y,*y,C ,*Ym,*Y,C ,*x,m ,*x,irm,r"))]
    "TARGET_MMX
     && (GET_CODE (operands[0]) != MEM || GET_CODE (operands[1]) != MEM)"
    "@
--- 87,95 ----

  (define_insn "*mov<mode>_internal"
    [(set (match_operand:MMXMODEI 0 "nonimmediate_operand"
!                       "=*y,y ,m ,*y,*Y,*Y,*Y ,m ,*x,*x,*x,m ,?r ,?m")
        (match_operand:MMXMODEI 1 "vector_move_operand"
!                       "C  ,ym,*y,*Y,*y,C ,*Ym,*Y,C ,*x,m ,*x,irm,r"))]
    "TARGET_MMX
     && (GET_CODE (operands[0]) != MEM || GET_CODE (operands[1]) != MEM)"
    "@
Comment 21 paolo.bonzini@lu.unisi.ch 2005-09-21 06:51:55 UTC
Subject: Re:  x87 reg allocated for constants for -mfpmath=sse


>Note that in this pattern cost computation of MMX_REGS  are all ignored ('*' in front of y). So, the cost 
>which is computed is for 'r' which is GENERAL_REGS. This cost is too high and eventually results in 
>memory cost to be lower than register cost. I tried the following simple patch as experiment and got all 
>the performance back (it is now comparable with 4.0). Note that in this patch, I removed the '*' in the 
>2nd alternative so cost of keeping the operand in mmx_regs class is factored in. This resulted in a 
>lower cost than that of memory. Is this the way to go? This is just an experiment which seems to work. 
>  
>
I think it makes sense.  The x86 back-end is playing too many tricks 
(such as the # classes) with the register allocator and regclass 
especially, and they are biting back.

Still, I'd rather hear from an expert as to why the classes were written 
like this.

Paolo
Comment 22 Dale Johannesen 2005-09-21 17:23:38 UTC
I agree with Paolo.
Comment 23 Paolo Bonzini 2005-11-22 09:21:54 UTC
Dale, can you please take care of merging this into 4.2?
Comment 24 Paolo Bonzini 2006-04-03 11:20:11 UTC
Subject: Bug 19653

Author: bonzini
Date: Mon Apr  3 11:20:07 2006
New Revision: 112637

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=112637
Log:
2005-08-08  Paolo Bonzini  <bonzini@gnu.org>
	    Dale Johannesen  <dalej@apple.com>

	PR target/19653
	* regclass.c (struct reg_pref): Update documentation.
	(regclass): Set prefclass to NO_REGS if memory is the best option.
	(record_reg_classes): Cope with a prefclass set to NO_REGS.
	* reload.c (find_reloads): Take PREFERRED_OUTPUT_RELOAD_CLASS
	into account.  For non-registers, equate an empty preferred
	reload class to a `!' in the constraint; move the if clause to
	do so after those that reject the insn.
	(push_reload): Allow PREFERRED_*_RELOAD_CLASS to liberally
	return NO_REGS.
	(find_dummy_reload): Likewise.
	* doc/tm.texi (Register Classes): Document what it means
	if PREFERRED_*_RELOAD_CLASS return NO_REGS.
	* config/i386/i386.c (ix86_preferred_reload_class): Force
	using SSE registers (and return NO_REGS for floating-point
	constants) if math is done with SSE.
	(ix86_preferred_output_reload_class): New.
	* config/i386/i386-protos.h (ix86_preferred_output_reload_class): New.
	* config/i386/i386.h (PREFERRED_OUTPUT_RELOAD_CLASS): New.
	* config/i386/i386.md: Remove # register preferences.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-protos.h
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/i386.md
    trunk/gcc/doc/tm.texi
    trunk/gcc/regclass.c
    trunk/gcc/reload.c

Comment 25 Paolo Bonzini 2006-04-03 11:20:39 UTC
fixed on mainline.
Comment 26 Paolo Bonzini 2006-04-18 08:23:43 UTC
Subject: Bug 19653

Author: bonzini
Date: Tue Apr 18 08:23:39 2006
New Revision: 113026

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113026
Log:
2006-04-18  Paolo Bonzini  <bonzini@gnu.org>

        PR target/27117

	Partial revert of revision 112637
	2006-04-03  Paolo Bonzini  <bonzini@gnu.org>
		    Dale Johannesen  <dalej@apple.com>

	PR target/19653
	* regclass.c (struct reg_pref): Update documentation.
	(regclass): Set prefclass to NO_REGS if memory is the best option.
	(record_reg_classes): Cope with a prefclass set to NO_REGS.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/regclass.c