This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Guard use of modulo in cshift (speedup protein)


Hi,

this patch speeds up polyhedrons protein on Bulldozer quite a bit.  The 
things is that in this testcase cshift is called with a very short length
(<=3) and that the shift amount always is less than the length.  
Surprisingly the division instruction takes up considerable amount of 
time, so much that it makes sense to guard it, when the shift is in bound.

Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):

    23  0.0032 :   caf00:       idiv   %r13
 13863  1.9055 :   caf03:       lea    (%rdx,%r13,1),%r12

I.e. despite the memory shuffling one third of the cshift cycles are that 
division.  With the patch the time for protein drops from 0m21.367s to 
0m20.547s on this Bulldozer machine.  I've checked that it has no adverse 
effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed, 
still an improvement).

Regstrapped on x86_64-linux.  Okay for trunk?


Ciao,
Michael.

	* m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.

	* generated/cshift0_c10.c: Regenerated.
	* generated/cshift0_c16.c: Regenerated.
	* generated/cshift0_c4.c: Regenerated.
	* generated/cshift0_c8.c: Regenerated.
	* generated/cshift0_i16.c: Regenerated.
	* generated/cshift0_i1.c: Regenerated.
	* generated/cshift0_i2.c: Regenerated.
	* generated/cshift0_i4.c: Regenerated.
	* generated/cshift0_i8.c: Regenerated.
	* generated/cshift0_r10.c: Regenerated.
	* generated/cshift0_r16.c: Regenerated.
	* generated/cshift0_r4.c: Regenerated.
	* generated/cshift0_r8.c: Regenerated.

Index: m4/cshift0.m4
===================================================================
--- m4/cshift0.m4	(revision 186272)
+++ m4/cshift0.m4	(working copy)
@@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons
   rptr = ret->base_addr;
   sptr = array->base_addr;
 
-  shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
-  if (shift < 0)
-    shift += len;
+  /* Avoid the costly modulo for trivially in-bound shifts.  */
+  if (shift < 0 || shift >= len)
+    {
+      shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
+      if (shift < 0)
+	shift += len;
+    }
 
   while (rptr)
     {


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]