[patch] expmed.c: Speed up synth_mult further.

Thu Sep 22 01:13:00 GMT 2005

Hi,

Attached is a patch to further speed up synth_mult.

First, note that we have four major ways to implement multiplication.

1. alg_shift

   Left shifts.  Only for even multipliers for obvious reasons.

2. alg_add_t_m2 and alg_sub_t_m2

   Only for odd multipliers t.  Transforms a * t into

     a * (t - 1) + a   or   a * (t + 1) - a.

3. alg_add_factor and alg_sub_factor

   For both odd and even multipliers t.  Transforms a * t into

     a * (2^m + 1) * t'   or   a * (2^m - 1) * t'

   for some m >= 2.

4. alg_add_t2_m and alg_sub_t2_m

   Only for odd multipliers t.  Transforms a * t into

     a * (t - 1)/4 + a   or   a * (t + 1)/4 - a

Notice that if a given multiplier is even, we have two ways to break
down the multiplication.  We could use alg_shift or
alg_{add,sub}_factor.  Suppose we pick alg_{add/sub}_factor.  Since
2^m +/- 1 is odd for any m >= 2, the remaining multiplier
t / (2^m +/- 1) is still even.  Again, we have the same choices -
alg_shift or alg_{add,sub}_factor.  Note that however many times we
choose alg_{add,sub}_factor, we eventually must take alg_shift because
eventually we will run out of factors of the form 2^m +/- 1 but the
multiplier stays even.

For example, if we are given multipler 40, we could emit either

(a << 3) * 5 or (a * 5) << 3.  These choices for even multipliers
increases the branching factor of the algorithm.

The patch solves this problem by using alg_shift exclusively for even
multipliers.

Here is a timing in seconds using the cross compiler from
x86_64-pc-linux-gnu to alpha-linux-gnu to compile the testcase in
PR 23971.

original patched diff(%)
   0.685   0.175   -74.5 (wow!)

The patch is too big for what it's really doing.

The original code looks like

  if (t is even)
    {
      try alg_shift.
    }

  if (t is odd)
    {
      try alg_add_t_m2 and alg_sub_t_m2.
    }

  try alg_add_factor and alg_sub_factor

  if (t is odd)
    {
      try alg_add_t2_m and alg_sub_t2_m.
    }

With my patch, the code will look like

  if (t is even)
    {
      try alg_shift.
    }
  else
    {
      try alg_add_t_m2 and alg_sub_t_m2.
      try alg_add_factor and alg_sub_factor
      try alg_add_t2_m and alg_sub_t2_m.
    }

Tested on x86_64-pc-linux-gnu.  OK to apply?

Kazu Hirata

2005-09-21  Kazu Hirata  <kazu@codesourcery.com>

	* expmed.c (synth_mult): Speed up by using alg_shift
	exclusively for even multipliers.

Index: expmed.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/expmed.c,v
retrieving revision 1.235
diff -c -d -p -r1.235 expmed.c
*** expmed.c	21 Sep 2005 16:32:10 -0000	1.235
--- expmed.c	21 Sep 2005 17:05:04 -0000
*************** synth_mult (struct algorithm *alg_out, u
*** 2566,2577 ****
        if (cache_hit)
  	goto done;
      }
! 
!   /* If we have an odd number, add or subtract one.  */
!   if ((t & 1) != 0)
      {
        unsigned HOST_WIDE_INT w;

      do_alg_addsub_t_m2:
        for (w = 1; (w & t) != 0; w <<= 1)
  	;
--- 2566,2576 ----
        if (cache_hit)
  	goto done;
      }
!   else
      {
        unsigned HOST_WIDE_INT w;

+       /* If we have an odd number, add or subtract one.  */
      do_alg_addsub_t_m2:
        for (w = 1; (w & t) != 0; w <<= 1)
  	;
*************** synth_mult (struct algorithm *alg_out, u
*** 2626,2732 ****
  	}
        if (cache_hit)
  	goto done;
-     }
- 
-   /* Look for factors of t of the form
-      t = q(2**m +- 1), 2 <= m <= floor(log2(t - 1)).
-      If we find such a factor, we can multiply by t using an algorithm that
-      multiplies by q, shift the result by m and add/subtract it to itself.

!      We search for large factors first and loop down, even if large factors
!      are less probable than small; if we find a large factor we will find a
!      good sequence quickly, and therefore be able to prune (by decreasing
!      COST_LIMIT) the search.  */

!  do_alg_addsub_factor:
!   for (m = floor_log2 (t - 1); m >= 2; m--)
!     {
!       unsigned HOST_WIDE_INT d;

!       d = ((unsigned HOST_WIDE_INT) 1 << m) + 1;
!       if (t % d == 0 && t > d && m < maxm
! 	  && (!cache_hit || cache_alg == alg_add_factor))
  	{
! 	  /* If the target has a cheap shift-and-add instruction use
! 	     that in preference to a shift insn followed by an add insn.
! 	     Assume that the shift-and-add is "atomic" with a latency
! 	     equal to its cost, otherwise assume that on superscalar
! 	     hardware the shift may be executed concurrently with the
! 	     earlier steps in the algorithm.  */
! 	  op_cost = add_cost[mode] + shift_cost[mode][m];
! 	  if (shiftadd_cost[mode][m] < op_cost)
  	    {
! 	      op_cost = shiftadd_cost[mode][m];
! 	      op_latency = op_cost;
! 	    }
! 	  else
! 	    op_latency = add_cost[mode];

! 	  new_limit.cost = best_cost.cost - op_cost;
! 	  new_limit.latency = best_cost.latency - op_latency;
! 	  synth_mult (alg_in, t / d, &new_limit, mode);

! 	  alg_in->cost.cost += op_cost;
! 	  alg_in->cost.latency += op_latency;
! 	  if (alg_in->cost.latency < op_cost)
! 	    alg_in->cost.latency = op_cost;
! 	  if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! 	    {
! 	      struct algorithm *x;
! 	      best_cost = alg_in->cost;
! 	      x = alg_in, alg_in = best_alg, best_alg = x;
! 	      best_alg->log[best_alg->ops] = m;
! 	      best_alg->op[best_alg->ops] = alg_add_factor;
  	    }
- 	  /* Other factors will have been taken care of in the recursion.  */
- 	  break;
- 	}

!       d = ((unsigned HOST_WIDE_INT) 1 << m) - 1;
!       if (t % d == 0 && t > d && m < maxm
! 	  && (!cache_hit || cache_alg == alg_sub_factor))
! 	{
! 	  /* If the target has a cheap shift-and-subtract insn use
! 	     that in preference to a shift insn followed by a sub insn.
! 	     Assume that the shift-and-sub is "atomic" with a latency
! 	     equal to it's cost, otherwise assume that on superscalar
! 	     hardware the shift may be executed concurrently with the
! 	     earlier steps in the algorithm.  */
! 	  op_cost = add_cost[mode] + shift_cost[mode][m];
! 	  if (shiftsub_cost[mode][m] < op_cost)
  	    {
! 	      op_cost = shiftsub_cost[mode][m];
! 	      op_latency = op_cost;
! 	    }
! 	  else
! 	    op_latency = add_cost[mode];

! 	  new_limit.cost = best_cost.cost - op_cost;
! 	  new_limit.latency = best_cost.latency - op_latency;
! 	  synth_mult (alg_in, t / d, &new_limit, mode);

! 	  alg_in->cost.cost += op_cost;
! 	  alg_in->cost.latency += op_latency;
! 	  if (alg_in->cost.latency < op_cost)
! 	    alg_in->cost.latency = op_cost;
! 	  if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! 	    {
! 	      struct algorithm *x;
! 	      best_cost = alg_in->cost;
! 	      x = alg_in, alg_in = best_alg, best_alg = x;
! 	      best_alg->log[best_alg->ops] = m;
! 	      best_alg->op[best_alg->ops] = alg_sub_factor;
  	    }
- 	  break;
  	}
!     }
!   if (cache_hit)
!     goto done;

!   /* Try shift-and-add (load effective address) instructions,
!      i.e. do a*3, a*5, a*9.  */
!   if ((t & 1) != 0)
!     {
      do_alg_add_t2_m:
        q = t - 1;
        q = q & -q;
--- 2625,2728 ----
  	}
        if (cache_hit)
  	goto done;

!       /* Look for factors of t of the form
! 	 t = q(2**m +- 1), 2 <= m <= floor(log2(t - 1)).
! 	 If we find such a factor, we can multiply by t using an algorithm that
! 	 multiplies by q, shift the result by m and add/subtract it to itself.

! 	 We search for large factors first and loop down, even if large factors
! 	 are less probable than small; if we find a large factor we will find a
! 	 good sequence quickly, and therefore be able to prune (by decreasing
! 	 COST_LIMIT) the search.  */

!     do_alg_addsub_factor:
!       for (m = floor_log2 (t - 1); m >= 2; m--)
  	{
! 	  unsigned HOST_WIDE_INT d;
! 
! 	  d = ((unsigned HOST_WIDE_INT) 1 << m) + 1;
! 	  if (t % d == 0 && t > d && m < maxm
! 	      && (!cache_hit || cache_alg == alg_add_factor))
  	    {
! 	      /* If the target has a cheap shift-and-add instruction use
! 		 that in preference to a shift insn followed by an add insn.
! 		 Assume that the shift-and-add is "atomic" with a latency
! 		 equal to its cost, otherwise assume that on superscalar
! 		 hardware the shift may be executed concurrently with the
! 		 earlier steps in the algorithm.  */
! 	      op_cost = add_cost[mode] + shift_cost[mode][m];
! 	      if (shiftadd_cost[mode][m] < op_cost)
! 		{
! 		  op_cost = shiftadd_cost[mode][m];
! 		  op_latency = op_cost;
! 		}
! 	      else
! 		op_latency = add_cost[mode];

! 	      new_limit.cost = best_cost.cost - op_cost;
! 	      new_limit.latency = best_cost.latency - op_latency;
! 	      synth_mult (alg_in, t / d, &new_limit, mode);

! 	      alg_in->cost.cost += op_cost;
! 	      alg_in->cost.latency += op_latency;
! 	      if (alg_in->cost.latency < op_cost)
! 		alg_in->cost.latency = op_cost;
! 	      if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! 		{
! 		  struct algorithm *x;
! 		  best_cost = alg_in->cost;
! 		  x = alg_in, alg_in = best_alg, best_alg = x;
! 		  best_alg->log[best_alg->ops] = m;
! 		  best_alg->op[best_alg->ops] = alg_add_factor;
! 		}
! 	      /* Other factors will have been taken care of in the recursion.  */
! 	      break;
  	    }

! 	  d = ((unsigned HOST_WIDE_INT) 1 << m) - 1;
! 	  if (t % d == 0 && t > d && m < maxm
! 	      && (!cache_hit || cache_alg == alg_sub_factor))
  	    {
! 	      /* If the target has a cheap shift-and-subtract insn use
! 		 that in preference to a shift insn followed by a sub insn.
! 		 Assume that the shift-and-sub is "atomic" with a latency
! 		 equal to it's cost, otherwise assume that on superscalar
! 		 hardware the shift may be executed concurrently with the
! 		 earlier steps in the algorithm.  */
! 	      op_cost = add_cost[mode] + shift_cost[mode][m];
! 	      if (shiftsub_cost[mode][m] < op_cost)
! 		{
! 		  op_cost = shiftsub_cost[mode][m];
! 		  op_latency = op_cost;
! 		}
! 	      else
! 		op_latency = add_cost[mode];

! 	      new_limit.cost = best_cost.cost - op_cost;
! 	      new_limit.latency = best_cost.latency - op_latency;
! 	      synth_mult (alg_in, t / d, &new_limit, mode);

! 	      alg_in->cost.cost += op_cost;
! 	      alg_in->cost.latency += op_latency;
! 	      if (alg_in->cost.latency < op_cost)
! 		alg_in->cost.latency = op_cost;
! 	      if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
! 		{
! 		  struct algorithm *x;
! 		  best_cost = alg_in->cost;
! 		  x = alg_in, alg_in = best_alg, best_alg = x;
! 		  best_alg->log[best_alg->ops] = m;
! 		  best_alg->op[best_alg->ops] = alg_sub_factor;
! 		}
! 	      break;
  	    }
  	}
!       if (cache_hit)
! 	goto done;

!       /* Try shift-and-add (load effective address) instructions,
! 	 i.e. do a*3, a*5, a*9.  */
      do_alg_add_t2_m:
        q = t - 1;
        q = q & -q;