This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[4.0 and mainline] Fix multiplication by constant expansion


Hi,
currently with -march=athlon we expand multiplication by 11 as

	leal	0(,%edx,4), %ecx
	movl	%edx, %eax
	sall	$4, %eax
	subl	%ecx, %eax
	subl	%edx, %eax

that is both longer and slower than imul.  Better equivalent would be:

	leal	(%edx,%edx,4), %eax
	leal	(%edx,%eax,2), %eax

that we sucesfully produce for -march=i686.  The problem is actually in
the generic expander that assume latency of rule:
   alg_add_t2_m         total := total * coeff + multiplicand
to be equivalent of add latency claiming that shift and add can execute
in parallel, that is wrong, since there is dependency chain.

The patch fixes that but leaves somewhat absurd case where we always
have same latency and cost.  This is correct because the algorithm as
done currently never introduce any parallelism.  It theoretically can
with:
   alg_add_t_m2         total := total + multiplicand * coeff;
but it never use it with coeff != 1.  I am going to try this adding
non-1 coefficients to see if it leads to anything good.  Athlon is
however example where shift-and-add instruction behave atomically with
latency of 2 so it won't introduce parallelism anyway.
Perhaps it would make sense to add specialized imul expander for i386 if
it can do signficantly better, I will try this out too.

The patch was bootstrapped/regtested i686-linux and it speeds up
sixtrack by over 10% with -march=athlon as well as few other bechmarks
on K8.  I also benchmarked P4 and saw no regressions there.

Thanks to H. J. Lu who originally pointed out the problem and Zdenek who
figured out the testcase!  OK for mainline and branch?

/* { dg-do compile { target i?86-*-* } } */
/* { dg-options "-O2 -march=athlon" } */
/* Multiplication by 11 should expand as sequence of two leas. */
/* { dg-final { scan-assembler "lea" } } */
/* { dg-final { scan-assembler-not "sub" } } */
/* { dg-final { scan-assembler-not "add" } } */
/* { dg-final { scan-assembler-not "neg" } } */
int a(int b)
{
	return b*11;
}

2006-01-01 Zdenek Dvorak <dvorakz@suse.cz>
            Jan Hubicka  <jh@suse.cz>
	    H.J. Lu  <hongjiu.lu@intel.com>
	* expmed.c (synth_mult): Fix computing of latency.
Index: expmed.c
===================================================================
*** expmed.c	(revision 108519)
--- expmed.c	(working copy)
*************** synth_mult (struct algorithm *alg_out, u
*** 2657,2676 ****
  	  /* If the target has a cheap shift-and-add instruction use
  	     that in preference to a shift insn followed by an add insn.
  	     Assume that the shift-and-add is "atomic" with a latency
! 	     equal to its cost, otherwise assume that on superscalar
! 	     hardware the shift may be executed concurrently with the
! 	     earlier steps in the algorithm.  */
  	  op_cost = add_cost[mode] + shift_cost[mode][m];
  	  if (shiftadd_cost[mode][m] < op_cost)
! 	    {
! 	      op_cost = shiftadd_cost[mode][m];
! 	      op_latency = op_cost;
! 	    }
  	  else
  	    op_latency = add_cost[mode];
  
  	  new_limit.cost = best_cost.cost - op_cost;
! 	  new_limit.latency = best_cost.latency - op_latency;
  	  synth_mult (alg_in, t / d, &new_limit, mode);
  
  	  alg_in->cost.cost += op_cost;
--- 2657,2671 ----
  	  /* If the target has a cheap shift-and-add instruction use
  	     that in preference to a shift insn followed by an add insn.
  	     Assume that the shift-and-add is "atomic" with a latency
! 	     equal to its cost */
  	  op_cost = add_cost[mode] + shift_cost[mode][m];
  	  if (shiftadd_cost[mode][m] < op_cost)
! 	    op_cost = shiftadd_cost[mode][m];
  	  else
  	    op_latency = add_cost[mode];
  
  	  new_limit.cost = best_cost.cost - op_cost;
! 	  new_limit.latency = best_cost.latency - op_cost;
  	  synth_mult (alg_in, t / d, &new_limit, mode);
  
  	  alg_in->cost.cost += op_cost;
*************** synth_mult (struct algorithm *alg_out, u
*** 2696,2715 ****
  	  /* If the target has a cheap shift-and-subtract insn use
  	     that in preference to a shift insn followed by a sub insn.
  	     Assume that the shift-and-sub is "atomic" with a latency
! 	     equal to it's cost, otherwise assume that on superscalar
! 	     hardware the shift may be executed concurrently with the
! 	     earlier steps in the algorithm.  */
  	  op_cost = add_cost[mode] + shift_cost[mode][m];
  	  if (shiftsub_cost[mode][m] < op_cost)
! 	    {
! 	      op_cost = shiftsub_cost[mode][m];
! 	      op_latency = op_cost;
! 	    }
  	  else
! 	    op_latency = add_cost[mode];
  
  	  new_limit.cost = best_cost.cost - op_cost;
! 	  new_limit.latency = best_cost.latency - op_latency;
  	  synth_mult (alg_in, t / d, &new_limit, mode);
  
  	  alg_in->cost.cost += op_cost;
--- 2691,2705 ----
  	  /* If the target has a cheap shift-and-subtract insn use
  	     that in preference to a shift insn followed by a sub insn.
  	     Assume that the shift-and-sub is "atomic" with a latency
! 	     equal to it's cost */
  	  op_cost = add_cost[mode] + shift_cost[mode][m];
  	  if (shiftsub_cost[mode][m] < op_cost)
! 	    op_cost = shiftsub_cost[mode][m];
  	  else
! 	    op_latency = op_cost;
  
  	  new_limit.cost = best_cost.cost - op_cost;
! 	  new_limit.latency = best_cost.latency - op_cost;
  	  synth_mult (alg_in, t / d, &new_limit, mode);
  
  	  alg_in->cost.cost += op_cost;
*************** synth_mult (struct algorithm *alg_out, u
*** 2742,2753 ****
  	{
  	  op_cost = shiftadd_cost[mode][m];
  	  new_limit.cost = best_cost.cost - op_cost;
  	  new_limit.latency = best_cost.latency - op_cost;
  	  synth_mult (alg_in, (t - 1) >> m, &new_limit, mode);
  
  	  alg_in->cost.cost += op_cost;
  	  alg_in->cost.latency += op_cost;
! 	  if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
  	    {
  	      struct algorithm *x;
  	      best_cost = alg_in->cost;
--- 2732,2759 ----
  	{
  	  op_cost = shiftadd_cost[mode][m];
  	  new_limit.cost = best_cost.cost - op_cost;
+ 	  /* Because we want to preffer combined lea over disjoint
+ 	     operations, we are interested in sollutions having equal
+ 	     cost to best one we found so far.  */
+ 	  if (best_cost.cost != cost_limit->cost && m)
+ 	    new_limit.cost++;
  	  new_limit.latency = best_cost.latency - op_cost;
+ 	  if (best_cost.latency != cost_limit->latency && m)
+ 	    new_limit.latency++;
  	  synth_mult (alg_in, (t - 1) >> m, &new_limit, mode);
  
  	  alg_in->cost.cost += op_cost;
  	  alg_in->cost.latency += op_cost;
! 	  if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost)
! 	      /* When the costs are equivalent, still preffer to use
! 	         shiftadd.  This should reduce total number of instructions
! 		 produced and register presure since intermediate result
! 		 of shift can't be reused later.
! 		 This help (for instance) some i386 CPUs where lea cost is
! 		 equivalent to cost of disjoint operations latency and cost
! 		 wise, while it still is shorter to encode and easier to
! 		 register allocate.  */
! 	      || (m && !CHEAPER_MULT_COST (&best_cost, &alg_in->cost)))
  	    {
  	      struct algorithm *x;
  	      best_cost = alg_in->cost;


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]