This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

SSE support in the mainline gcc


Hi
Thanks to extremly quick response of Richard Henderson, I've merged to the
mainline CVS most of code neccesary to make gcc generate floating point code
using SSE(2) instruction set. The code does not work in mainline gcc yet,
because of problems with the SSE builtins and stack alignment. Since I need
to concentrate on x86-64 issues this week, current situation will probably
remain till next week, so I am sending short message for those interested
to try the feature.

Attached patch will help you to get around the problems in gcc by disabling the
SSE builtins and using movups instead of movaps for TImode moves.  This is
neccesary because of incompletted merger of SSE builtins code.  You also need
latest binutils snapshot for SSE2 instruction set support (you may use the
binutils-2_11 branch) and you may want to install the patch for SSE based FP
comparisons to avoid gcc from becoming crazy at each FP comparison:

http://gcc.gnu.org/ml/gcc-patches/2001-02/msg00719.html

Then the gcc should work and produce quite sane code modulo following problems:

1) caller save code needs to teach about not using widest reg mode, but mode
   register is really used.  It hurts badly to save each SSE register as
   128bit unaligned value. (it hurts in i387 to always save XFmode too
   and in integer unit it causes partial register stalls)
2) regclass needs to be teached to propagate register class information
   (see my propagation patches about two years old)
3) abs/neg/conditional move patterns needs to be implemented - currently
   gcc just moves the value to i387 and back - ouch! Conditinal moves
   needs to be done using logical operations on FP value I don't know how
   to represent in gcc clearly yet.
4) scheduling and tunning is needed of course :)
5) sometimes reg-stack pass seems to even help, because it undoes some of
   reload braindamage due it's non-global nature.

If you are interested to help with these, please let me know first - in my
SSE branch I am having still some code to merge.

I've also run few bencharks on P4.  In simple internal loops, the code
seems to rock:

(comparison -msse2 to -mno-sse2 in long double, double and float arithmetics)

Floating point tests
Mandelbrot set calculation loop (tests/mset.c)
100%    133%    129% 
Floating point and constants mix (tests/fpconmix.c)
 96%    133%    120% 
Floating point and integer mix (tests/fpintmix.c)
100%    198%    223% 
Unrolled mandelbrot set calculation loop (tests/umset.c)
101%    118%    120% 
Quicksort (tests/qsort.c)
 91%    122%    122% 

END

In byte benchmark I get slowdown from 8.34 to 7.58 - for mixture of purposes
mentioned above.

If you try this, let me know how it works!

Index: i386.c
===================================================================
RCS file: /cvs/gcc/egcs/gcc/config/i386/i386.c,v
retrieving revision 1.216
diff -c -3 -p -r1.216 i386.c
*** i386.c	2001/02/13 13:54:44	1.216
--- i386.c	2001/02/13 17:45:39
*************** order_regs_for_local_alloc ()
*** 712,717 ****
--- 712,724 ----
      {
        for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
  	reg_alloc_order[i] = i;
+       /* Arrange SSE regs before STACK ones, since SSE arithmetics is usually
+ 	 faster when available.  */
+       for (i = 0; i < 8; i++)
+ 	{
+ 	  reg_alloc_order[i + FIRST_STACK_REG] = i + FIRST_SSE_REG;
+ 	  reg_alloc_order[i + FIRST_SSE_REG] = i + FIRST_STACK_REG;
+ 	}
      }
  }
  
*************** static struct builtin_description bdesc_
*** 7649,7654 ****
--- 7672,7678 ----
  void
  ix86_init_builtins ()
  {
+ #if 0
    struct builtin_description * d;
    size_t i;
    tree endlink = void_list_node;
*************** ix86_init_builtins ()
*** 8014,8019 ****
--- 8038,8044 ----
    def_builtin ("__builtin_ia32_loadrps", v4sf_ftype_pfloat, IX86_BUILTIN_LOADRPS);
    def_builtin ("__builtin_ia32_storeps1", void_ftype_pfloat_v4sf, IX86_BUILTIN_STOREPS1);
    def_builtin ("__builtin_ia32_storerps", void_ftype_pfloat_v4sf, IX86_BUILTIN_STORERPS);
+ #endif
  }
  
  /* Errors in the source file can cause expand_expr to return const0_rtx
Index: i386.md
===================================================================
RCS file: /cvs/gcc/egcs/gcc/config/i386/i386.md,v
retrieving revision 1.207
diff -c -3 -p -r1.207 i386.md
*** i386.md	2001/02/13 13:54:44	1.207
--- i386.md	2001/02/13 17:45:48
***************
*** 13125,13132 ****
  	(match_operand:TI 1 "general_operand" "xm,x"))]
    "TARGET_SSE"
    "@
!    movaps\\t{%1, %0|%0, %1}
!    movaps\\t{%1, %0|%0, %1}"
    [(set_attr "type" "sse")])
  
  ;; These two patterns are useful for specifying exactly whether to use
--- 13285,13292 ----
  	(match_operand:TI 1 "general_operand" "xm,x"))]
    "TARGET_SSE"
    "@
!    movups\\t{%1, %0|%0, %1}
!    movups\\t{%1, %0|%0, %1}"
    [(set_attr "type" "sse")])
  
  ;; These two patterns are useful for specifying exactly whether to use


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]