This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Committed: CRIS: new multilib for v8, libgcc improvements and move to soft-fp.


There's a page-full or two of numbers reported here with the
background, but maintainers of software-floating-point ports
used with a microcontroller may find that of use, if they're on
a cycle or size budget and consider fp-bit vs. soft-fp.

For an on-chip controller subsystem with a CRIS CPU, there was a
control-loop with floating-point numbers involved, with numbers
having a range that appeared to not lend itself to easy conversion
to our fixed-point-library (shameless plug here seems in order;
it's LGPL: <http://savannah.nongnu.org/projects/fixmath/>)
i.e. there was no suitable "fixed point" position within 32
bits.  Worse, this particular CRIS CPU does not have fast
multiplication; it has the "CRIS v8" ISA.

The golden model for the control loop used "double", but a raw
cycle count using that type showed 117836 cycles (average over a
data set of 1000 iterations, using the CRIS simulator
co-habiting with the gdb project).  The budget was 25000.
Thankfully, 24 bits precision proved sufficient.  Switching to
"float" make the number of cycles shrink to 46974.  Still a long
way to go.  Switching to a separate multilib made sense, as this
CRIS version has a leading-zeros-count insn which did not exist
in the base version, and libgcc makes use of that, both for the
older fp-bit.c and for the newer soft-fp floating-point
libraries.  CRIS used the older fp-bit library only because I've
never found reason to try the newer soft-fp before, but I
recalled promises of big performance improvements.  The multilib
did help some, but not much; just taking the number down to
45741 cycles (some 2%).  A special umulsidi3 function and an
improved longlong.h helped more, down to 43266 cycles; about 5%.
Then I tried soft-fp and...  Bam!  The number of cycles went
down to 23318, a 46% improvement.  Arguably, there was a
downsize: the size of the program went up, from 10901 to 13741
bytes.  More tweaks included in the patch (arit.c, mulsi3.S)
showed small improvements, down to 23236 cycles.

If you're in this situation and using newlib, you'll also find
that using __ieee754_sqrtf instead of sqrtf helps and also to
use -ffast-math if your formulas allows that.  Though, those
changes just took the number of cycles down to 22951 (1%).  Use
of the internal sqrt-function makes more sense from a size
perspective, as the wrapper function pulls in conversion support
from and to double (obvious from the code).  The difference is
13713 to 11035 bytes (static data and program), which matters
more when you know that the size of the associated internal RAM
is 16 KiB and that those previous numbers exclude startup code,
stack and heap.

Checking afterwards whether the separate multilib still made
sense, it seems that without it, the number of cycles would be
25158 (with all other improvements in), so yes.  Apparently
soft-fp makes more use of leading-zeros-count than fp-bit.

A belated bottom-line advice: port maintainers who have not
already done so, switch to soft-fp from fp-bit if floating-point
performance ever may matter to your port.  If you're worried
it'll bloat the code, that's your decision, but you can actually
get something like twice the speed *at the application level*
for a 1/4 size increase.  If you're worried the library doesn't
have the knobs you want for your port, it does have them; a lot
more than fp-bit.  Have a look: are several different core code
choices for each of multiplication, division and addition.  And
as seen in the patch, I didn't even have to turn those knobs.
(Also, the goal of the target is met and I'm out of time.
Later.)

Tested to not regress for cris-elf (all multilibs except v32)
and crisv32-elf for r203561.  Also built crisv32-linux-gnu for
sanity-checking, but that was for a local import done long ago,
4.7-era.

gcc:
	* config/cris/t-elfmulti (MULTILIB_OPTIONS, MULTILIB_DIRNAMES)
	(MULTILIB_MATCHES): Add multilib for -march=v8.

libgcc:
	For CRIS ports, switch to soft-fp.  Improve arit.c and longlong.h.
	* config.host (cpu_type) <Setting default>: Add entry for
	crisv32-*-*.
	(tmake_file) <crisv32-*-elf, cris-*-elf, cris-*-linux*>
	<crisv32-*-linux*>: Adjust.
	* longlong.h: Wrap the whole CRIS section in a single
	defined(__CRIS__) conditional.  Add comment about add_ssaaaa
	and sub_ddmmss.
	(COUNT_LEADING_ZEROS_0): Define when count_leading_zeros is
	defined.
	[__CRIS__] (__umulsidi3): Define.
	[__CRIS__] (umul_ppmm): Define in terms of __umulsidi3.
	* config/cris/sfp-machine.h: New file.
	* config/cris/umulsidi3.S: New file.
	* config/cris/t-elfmulti (LIB2ADD_ST): Add umulsidi3.S.
	* config/cris/arit.c (SIGNMULT): New macro.
	(__Div, __Mod): Use SIGNMULT instead of naked multiplication.
	* config/cris/mulsi3.S: Tweak to avoid redundant register-copying;
	saving 3 out of originally 33 cycles from the fastest
	path, 3 out of 54 from the medium path and one from the longest
	path.  Improve comments.

diff --git a/gcc/config/cris/t-elfmulti b/gcc/config/cris/t-elfmulti
index 29ed57d..8bdbc55 100644
--- a/gcc/config/cris/t-elfmulti
+++ b/gcc/config/cris/t-elfmulti
@@ -16,9 +16,10 @@
 # along with GCC; see the file COPYING3.  If not see
 # <http://www.gnu.org/licenses/>.
 
-MULTILIB_OPTIONS = march=v10/march=v32
-MULTILIB_DIRNAMES = v10 v32
+MULTILIB_OPTIONS = march=v8/march=v10/march=v32
+MULTILIB_DIRNAMES = v8 v10 v32
 MULTILIB_MATCHES = \
+		march?v8=mcpu?v8 \
 		march?v10=mcpu?etrax100lx \
 		march?v10=mcpu?ng \
 		march?v10=march?etrax100lx \
diff --git a/libgcc/config.host b/libgcc/config.host
index c853a12..841167b 100644
--- a/libgcc/config.host
+++ b/libgcc/config.host
@@ -100,6 +100,9 @@ bfin*-*)
 	;;
 cr16-*-*)
 	;;
+crisv32-*-*)
+	cpu_type=cris
+	;;
 fido-*-*)
 	cpu_type=m68k
 	;;
@@ -422,13 +425,13 @@ cr16-*-elf)
 	extra_parts="$extra_parts crti.o crtn.o crtlibid.o"
         ;;
 crisv32-*-elf)
-	tmake_file="$tmake_file cris/t-cris t-fdpbit"
+	tmake_file="$tmake_file cris/t-cris t-softfp-sfdf t-softfp"
  	;;
 cris-*-elf)
-	tmake_file="$tmake_file cris/t-cris t-fdpbit cris/t-elfmulti"
+	tmake_file="$tmake_file cris/t-cris t-softfp-sfdf t-softfp cris/t-elfmulti"
 	;;
 cris-*-linux* | crisv32-*-linux*)
-	tmake_file="$tmake_file cris/t-cris t-fdpbit cris/t-linux"
+	tmake_file="$tmake_file cris/t-cris t-softfp-sfdf t-softfp cris/t-linux"
 	;;
 epiphany-*-elf*)
 	tmake_file="epiphany/t-epiphany t-fdpbit epiphany/t-custom-eqsf"
diff --git a/libgcc/config/cris/arit.c b/libgcc/config/cris/arit.c
index 32255f9..21bec66 100644
--- a/libgcc/config/cris/arit.c
+++ b/libgcc/config/cris/arit.c
@@ -39,6 +39,14 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #define LZ(v) __builtin_clz (v)
 #endif
 
+/* In (at least) the 4.7 series, GCC doesn't automatically choose the
+   most optimal strategy, possibly related to insufficient modelling of
+   delay-slot costs.  */
+#if defined (__CRIS_arch_version) && __CRIS_arch_version >= 10
+#define SIGNMULT(s, a) ((s) * (a)) /* Cheap multiplication, better than branch.  */
+#else
+#define SIGNMULT(s, a) ((s) < 0 ? -(a) : (a)) /* Branches are still better.  */
+#endif
 
 #if defined (L_udivsi3) || defined (L_divsi3) || defined (L_umodsi3) \
     || defined (L_modsi3)
@@ -199,6 +207,7 @@ __Div (long a, long b)
 {
   long extra = 0;
   long sign = (b < 0) ? -1 : 1;
+  long res;
 
   /* We need to handle a == -2147483648 as expected and must while
      doing that avoid producing a sequence like "abs (a) < 0" as GCC
@@ -214,15 +223,14 @@ __Div (long a, long b)
       if ((a & 0x7fffffff) == 0)
 	{
 	  /* We're at 0x80000000.  Tread carefully.  */
-	  a -= b * sign;
+	  a -= SIGNMULT (sign, b);
 	  extra = sign;
 	}
       a = -a;
     }
 
-  /* We knowingly penalize pre-v10 models by multiplication with the
-     sign.  */
-  return sign * do_31div (a, __builtin_labs (b)).quot + extra;
+  res = do_31div (a, __builtin_labs (b)).quot;
+  return SIGNMULT (sign, res) + extra;
 }
 #endif /* L_divsi3 */
 
@@ -274,6 +282,7 @@ long
 __Mod (long a, long b)
 {
   long sign = 1;
+  long res;
 
   /* We need to handle a == -2147483648 as expected and must while
      doing that avoid producing a sequence like "abs (a) < 0" as GCC
@@ -291,7 +300,8 @@ __Mod (long a, long b)
       a = -a;
     }
 
-  return sign * do_31div (a, __builtin_labs (b)).rem;
+  res = do_31div (a, __builtin_labs (b)).rem;
+  return SIGNMULT (sign, res);
 }
 #endif /* L_modsi3 */
 #endif /* L_udivsi3 || L_divsi3 || L_umodsi3 || L_modsi3 */
diff --git a/libgcc/config/cris/mulsi3.S b/libgcc/config/cris/mulsi3.S
index 76dfb63..213ed90 100644
--- a/libgcc/config/cris/mulsi3.S
+++ b/libgcc/config/cris/mulsi3.S
@@ -113,16 +113,22 @@ ___Mul:
 	ret
 	nop
 #else
-	move.d $r10,$r12
+;; See if we can avoid multiplying some of the parts, knowing
+;; they're zero.
+
 	move.d $r11,$r9
-	bound.d $r12,$r9
+	bound.d $r10,$r9
 	cmpu.w 65535,$r9
 	bls L(L3)
-	move.d $r12,$r13
+	move.d $r10,$r12
 
-	movu.w $r11,$r9
+;; Nope, have to do all the parts of a 32-bit multiplication.
+;; See head comment in optabs.c:expand_doubleword_mult.
+
+	move.d $r10,$r13
+	movu.w $r11,$r9 ; ab*cd = (a*d + b*c)<<16 + b*d
 	lslq 16,$r13
-	mstep $r9,$r13
+	mstep $r9,$r13	; d*b
 	mstep $r9,$r13
 	mstep $r9,$r13
 	mstep $r9,$r13
@@ -140,7 +146,7 @@ ___Mul:
 	mstep $r9,$r13
 	clear.w $r10
 	test.d $r10
-	mstep $r9,$r10
+	mstep $r9,$r10	; d*a
 	mstep $r9,$r10
 	mstep $r9,$r10
 	mstep $r9,$r10
@@ -157,10 +163,9 @@ ___Mul:
 	mstep $r9,$r10
 	mstep $r9,$r10
 	movu.w $r12,$r12
-	move.d $r11,$r9
-	clear.w $r9
-	test.d $r9
-	mstep $r12,$r9
+	clear.w $r11
+	move.d $r11,$r9 ; Doubles as a "test.d" preparing for the mstep.
+	mstep $r12,$r9	; b*c
 	mstep $r12,$r9
 	mstep $r12,$r9
 	mstep $r12,$r9
@@ -182,17 +187,24 @@ ___Mul:
 	add.d $r13,$r10
 
 L(L3):
-	move.d $r9,$r10
+;; Form the maximum in $r10, by knowing the minimum, $r9.
+;; (We don't know which one of $r10 or $r11 it is.)
+;; Check if the largest operand is still just 16 bits.
+
+	xor $r9,$r10
 	xor $r11,$r10
-	xor $r12,$r10
 	cmpu.w 65535,$r10
 	bls L(L5)
 	movu.w $r9,$r13
 
-	movu.w $r13,$r13
+;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but c==0
+;; so we only need (a*d)<<16 + b*d with d = $r13, ab = $r10.
+;; We drop the upper part of (a*d)<<16 as we're only doing a
+;; 32-bit-result multiplication.
+
 	move.d $r10,$r9
 	lslq 16,$r9
-	mstep $r13,$r9
+	mstep $r13,$r9	; b*d
 	mstep $r13,$r9
 	mstep $r13,$r9
 	mstep $r13,$r9
@@ -210,7 +222,7 @@ L(L3):
 	mstep $r13,$r9
 	clear.w $r10
 	test.d $r10
-	mstep $r13,$r10
+	mstep $r13,$r10	; a*d
 	mstep $r13,$r10
 	mstep $r13,$r10
 	mstep $r13,$r10
@@ -231,25 +243,27 @@ L(L3):
 	add.d $r9,$r10
 
 L(L5):
-	movu.w $r9,$r9
+;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but a and c==0
+;; so b*d (with b=$r13, a=$r10) it is.
+
 	lslq 16,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
-	mstep $r9,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
 	ret
-	mstep $r9,$r10
+	mstep $r13,$r10
 #endif
 L(Lfe1):
 	.size	___Mul,L(Lfe1)-___Mul
diff --git a/libgcc/config/cris/sfp-machine.h b/libgcc/config/cris/sfp-machine.h
new file mode 100644
index 0000000..0d52a70
--- /dev/null
+++ b/libgcc/config/cris/sfp-machine.h
@@ -0,0 +1,78 @@
+/* Soft-FP definitions for CRIS.
+   Copyright (C) 2013 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+<http://www.gnu.org/licenses/>.  */
+
+#define _FP_W_TYPE_SIZE		32
+#define _FP_W_TYPE		unsigned long
+#define _FP_WS_TYPE		signed long
+#define _FP_I_TYPE		long
+
+/* The type of the result of a floating point comparison.  This must
+   match `__libgcc_cmp_return__' in GCC for the target.  */
+typedef int __gcc_CMPtype __attribute__ ((mode (__libgcc_cmp_return__)));
+#define CMPtype __gcc_CMPtype
+
+/* FIXME: none of the *MEAT* macros have actually been benchmarked to be
+   better than any other choice for any CRIS variant.  */
+
+#define _FP_MUL_MEAT_S(R,X,Y)				\
+  _FP_MUL_MEAT_1_wide(_FP_WFRACBITS_S,R,X,Y,umul_ppmm)
+#define _FP_MUL_MEAT_D(R,X,Y)				\
+  _FP_MUL_MEAT_2_wide(_FP_WFRACBITS_D,R,X,Y,umul_ppmm)
+
+#define _FP_DIV_MEAT_S(R,X,Y)	_FP_DIV_MEAT_1_loop(S,R,X,Y)
+#define _FP_DIV_MEAT_D(R,X,Y)	_FP_DIV_MEAT_2_udiv(D,R,X,Y)
+
+#define _FP_NANFRAC_S		((_FP_QNANBIT_S << 1) - 1)
+#define _FP_NANFRAC_D		((_FP_QNANBIT_D << 1) - 1), -1
+#define _FP_NANSIGN_S		0
+#define _FP_NANSIGN_D		0
+#define _FP_QNANNEGATEDP 0
+#define _FP_KEEPNANFRACP 1
+
+/* Someone please check this.  */
+#define _FP_CHOOSENAN(fs, wc, R, X, Y, OP)			\
+  do {								\
+    if ((_FP_FRAC_HIGH_RAW_##fs(X) & _FP_QNANBIT_##fs)		\
+	&& !(_FP_FRAC_HIGH_RAW_##fs(Y) & _FP_QNANBIT_##fs))	\
+      {								\
+	R##_s = Y##_s;						\
+	_FP_FRAC_COPY_##wc(R,Y);				\
+      }								\
+    else							\
+      {								\
+	R##_s = X##_s;						\
+	_FP_FRAC_COPY_##wc(R,X);				\
+      }								\
+    R##_c = FP_CLS_NAN;						\
+  } while (0)
+
+#define	__LITTLE_ENDIAN	1234
+#define	__BIG_ENDIAN	4321
+
+# define __BYTE_ORDER __LITTLE_ENDIAN
+
+/* Define ALIASNAME as a strong alias for NAME.  */
+# define strong_alias(name, aliasname) _strong_alias(name, aliasname)
+# define _strong_alias(name, aliasname) \
+  extern __typeof (name) aliasname __attribute__ ((alias (#name)));
diff --git a/libgcc/config/cris/t-elfmulti b/libgcc/config/cris/t-elfmulti
index b180521..308ef51 100644
--- a/libgcc/config/cris/t-elfmulti
+++ b/libgcc/config/cris/t-elfmulti
@@ -1,3 +1,3 @@
-LIB2ADD_ST = $(srcdir)/config/cris/mulsi3.S
+LIB2ADD_ST = $(srcdir)/config/cris/mulsi3.S $(srcdir)/config/cris/umulsidi3.S
 
 CRTSTUFF_T_CFLAGS = -moverride-best-lib-options
diff --git a/libgcc/config/cris/umulsidi3.S b/libgcc/config/cris/umulsidi3.S
new file mode 100644
index 0000000..bf9858d
--- /dev/null
+++ b/libgcc/config/cris/umulsidi3.S
@@ -0,0 +1,289 @@
+;; Copyright (C) 2001, 2004, 2013 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+;;
+;; GCC is free software; you can redistribute it and/or modify it under
+;; the terms of the GNU General Public License as published by the Free
+;; Software Foundation; either version 3, or (at your option) any later
+;; version.
+;;
+;; GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+;; WARRANTY; without even the implied warranty of MERCHANTABILITY or
+;; FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+;; for more details.
+;;
+;; Under Section 7 of GPL version 3, you are granted additional
+;; permissions described in the GCC Runtime Library Exception, version
+;; 3.1, as published by the Free Software Foundation.
+;;
+;; You should have received a copy of the GNU General Public License and
+;; a copy of the GCC Runtime Library Exception along with this program;
+;; see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+;; <http://www.gnu.org/licenses/>.
+;;
+;; This code is derived from mulsi3.S, observing that the mstep*16-based
+;; multiplications there, from which it is formed, are actually
+;; zero-extending; in gcc-speak "umulhisi3".  The difference to *this*
+;; function is just a missing top mstep*16 sequence and shifts and 64-bit
+;; additions for the high part.  Compared to an implementation based on
+;; calling __Mul four times (see default implementation of umul_ppmm in
+;; longlong.h), this will complete in a time between a fourth and a third
+;; of that, assuming the value-based optimizations don't strike.  If they
+;; all strike there (very often) but none here, we still win, though by a
+;; lesser margin, due to lesser total overhead.
+
+#define L(x) .x
+#define CONCAT1(a, b) CONCAT2(a, b)
+#define CONCAT2(a, b) a ## b
+
+#ifdef __USER_LABEL_PREFIX__
+# define SYM(x) CONCAT1 (__USER_LABEL_PREFIX__, x)
+#else
+# define SYM(x) x
+#endif
+
+	.global SYM(__umulsidi3)
+	.type	SYM(__umulsidi3),@function
+SYM(__umulsidi3):
+#if defined (__CRIS_arch_version) && __CRIS_arch_version >= 10
+;; Can't have the mulu.d last on a cache-line, due to a hardware bug.  See
+;; the documentation for -mmul-bug-workaround.
+;; Not worthwhile to conditionalize here.
+	.p2alignw 2,0x050f
+	mulu.d $r11,$r10
+	ret
+	move $mof,$r11
+#else
+	move.d $r11,$r9
+	bound.d $r10,$r9
+	cmpu.w 65535,$r9
+	bls L(L3)
+	move.d $r10,$r12
+
+	move.d $r10,$r13
+	movu.w $r11,$r9 ; ab*cd = (a*c)<<32 (a*d + b*c)<<16 + b*d
+
+;; We're called for floating point numbers very often with the "low" 16
+;; bits zero, so it's worthwhile to optimize for that.
+
+	beq L(L6)	; d == 0?
+	lslq 16,$r13
+
+	beq L(L7)	; b == 0?
+	clear.w $r10
+
+	mstep $r9,$r13	; d*b
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+	mstep $r9,$r13
+
+L(L7):
+	test.d $r10
+	mstep $r9,$r10	; d*a
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+	mstep $r9,$r10
+
+;; d*a in $r10, d*b in $r13, ab in $r12 and cd in $r11
+;; $r9 = d, need to do b*c and a*c; we can drop d.
+;; so $r9 is up for use and we can shift down $r11 as the mstep
+;; source for the next mstep-part.
+
+L(L8):
+	lsrq 16,$r11
+	move.d $r12,$r9
+	lslq 16,$r9
+	beq L(L9)	; b == 0?
+	mstep $r11,$r9
+
+	mstep $r11,$r9	; b*c
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+	mstep $r11,$r9
+L(L9):
+
+;; d*a in $r10, d*b in $r13, c*b in $r9, ab in $r12 and c in $r11,
+;; need to do a*c.  We want that to end up in $r11, so we shift up $r11 to
+;; now use as the destination operand.  We'd need a test insn to update N
+;; to do it the other way round.
+
+	lsrq 16,$r12
+	lslq 16,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+	mstep $r12,$r11
+
+;; d*a in $r10, d*b in $r13, c*b in $r9, a*c in $r11 ($r12 free).
+;; Need (a*d + b*c)<<16 + b*d into $r10 and
+;; a*c + (a*d + b*c)>>16 plus carry from the additions into $r11.
+
+	add.d $r9,$r10	; (a*d + b*c) - may produce a carry.
+	scs $r12	; The carry corresponds to bit 16 of $r11.
+	lslq 16,$r12
+	add.d $r12,$r11	; $r11 = a*c + carry from (a*d + b*c).
+
+#if defined (__CRIS_arch_version) && __CRIS_arch_version >= 8
+	swapw $r10
+	addu.w $r10,$r11 ; $r11 = a*c + (a*d + b*c) >> 16 including carry.
+	clear.w $r10	; $r10 = (a*d + b*c) << 16
+#else
+	move.d $r10,$r9
+	lsrq 16,$r9
+	add.d $r9,$r11	; $r11 = a*c + (a*d + b*c) >> 16 including carry.
+	lslq 16,$r10	; $r10 = (a*d + b*c) << 16
+#endif
+	add.d $r13,$r10	; $r10 = (a*d + b*c) << 16 + b*d - may produce a carry.
+	scs $r9
+	ret
+	add.d $r9,$r11	; Last carry added to the high-order 32 bits.
+
+L(L6):
+	clear.d $r13
+	ba L(L8)
+	clear.d $r10
+
+L(L11):
+	clear.d $r10
+	ret
+	clear.d $r11
+
+L(L3):
+;; Form the maximum in $r10, by knowing the minimum, $r9.
+;; (We don't know which one of $r10 or $r11 it is.)
+;; Check if the largest operand is still just 16 bits.
+
+	xor $r9,$r10
+	xor $r11,$r10
+	cmpu.w 65535,$r10
+	bls L(L5)
+	movu.w $r9,$r13
+
+;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but c==0
+;; so we only need (a*d)<<16 + b*d with d = $r13, ab = $r10.
+;; Remember that the upper part of (a*d)<<16 goes into the lower part
+;; of $r11 and there may be a carry from adding the low 32 parts.
+	beq L(L11)	; d == 0?
+	move.d $r10,$r9
+
+	lslq 16,$r9
+	beq L(L10)	; b == 0?
+	clear.w $r10
+
+	mstep $r13,$r9	; b*d
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+	mstep $r13,$r9
+L(L10):
+	test.d $r10
+	mstep $r13,$r10	; a*d
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	move.d $r10,$r11
+	lsrq 16,$r11
+	lslq 16,$r10
+	add.d $r9,$r10
+	scs $r12
+	ret
+	add.d $r12,$r11
+
+L(L5):
+;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but a and c==0
+;; so b*d (with min=b=$r13, max=d=$r10) it is.  As it won't overflow the
+;; 32-bit part, just set $r11 to 0.
+
+	lslq 16,$r10
+	clear.d $r11
+
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	mstep $r13,$r10
+	ret
+	mstep $r13,$r10
+#endif
+L(Lfe1):
+	.size	SYM(__umulsidi3),L(Lfe1)-SYM(__umulsidi3)
diff --git a/libgcc/longlong.h b/libgcc/longlong.h
index 30cc2e3..24dbae4 100644
--- a/libgcc/longlong.h
+++ b/libgcc/longlong.h
@@ -272,12 +272,39 @@ UDItype __umulsidi3 (USItype, USItype);
 
 #endif /* defined (__AVR__) */
 
-#if defined (__CRIS__) && __CRIS_arch_version >= 3
+#if defined (__CRIS__)
+
+#if __CRIS_arch_version >= 3
 #define count_leading_zeros(COUNT, X) ((COUNT) = __builtin_clz (X))
+#define COUNT_LEADING_ZEROS_0 32
+#endif /* __CRIS_arch_version >= 3 */
+
 #if __CRIS_arch_version >= 8
 #define count_trailing_zeros(COUNT, X) ((COUNT) = __builtin_ctz (X))
-#endif
-#endif /* __CRIS__ */
+#endif /* __CRIS_arch_version >= 8 */
+
+#if __CRIS_arch_version >= 10
+#define __umulsidi3(u,v) ((UDItype)(USItype) (u) * (UDItype)(USItype) (v))
+#else
+#define __umulsidi3 __umulsidi3
+extern UDItype __umulsidi3 (USItype, USItype);
+#endif /* __CRIS_arch_version >= 10 */
+
+#define umul_ppmm(w1, w0, u, v)		\
+  do {					\
+    UDItype __x = __umulsidi3 (u, v);	\
+    (w0) = (USItype) (__x);		\
+    (w1) = (USItype) (__x >> 32);	\
+  } while (0)
+
+/* FIXME: defining add_ssaaaa and sub_ddmmss should be advantageous for
+   DFmode ("double" intrinsics, avoiding two of the three insns handling
+   carry), but defining them as open-code C composing and doing the
+   operation in DImode (UDImode) shows that the DImode needs work:
+   register pressure from requiring neighboring registers and the
+   traffic to and from them come to dominate, in the 4.7 series.  */
+
+#endif /* defined (__CRIS__) */
 
 #if defined (__hppa) && W_TYPE_SIZE == 32
 #define add_ssaaaa(sh, sl, ah, al, bh, bl) \

brgds, H-P


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]