15239 – suboptimal mult-by-const expansion cost limit

Bug 15239 - suboptimal mult-by-const expansion cost limit

Summary: suboptimal mult-by-const expansion cost limit

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	3.4.0

Importance:	P2 enhancement
Target Milestone:	4.0.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2004-04-30 23:46 UTC by Luchezar Belev
Modified:	2004-06-24 21:07 UTC (History)
CC List:	1 user (show)

See Also:
Host:	*
Target:	*
Build:	*
Known to work:
Known to fail:
Last reconfirmed:	2004-05-01 00:11:43

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luchezar Belev 2004-04-30 23:46:04 UTC

Look in the file gcc/expmed.c, function expand_mult(). The code 
      mult_cost = rtx_cost (gen_rtx_MULT (mode, op0, op1), SET); 
      mult_cost = MIN (12 * add_cost, mult_cost); 
calculates the initial cost limit before the recursive search for 
a good expansion is invoked. 
 
The problem is that the limiting of the cost with the cost of 12 adds (the 
second line) is arbitrary and for some architectures (like pentium4) where 
the multiplication is far more expensive than 12 adds, the generated code is 
likely to be suboptimal as execution speed. 
It would be better if this limit is somehow platform and/or optimizing-level 
dependent (i.e. don't limit artificially the expansion for -O3 or higher). 
Intel's compiler (icc) doesn't limit the expansion in such a way even 
though for some constants it generates rather long sequences of 
adds/subs/leas/shls (instead of an imul).

Comment 1 Joseph S. Myers 2004-05-01 00:07:40 UTC

Subject: Re:  New: suboptimal mult-by-const expansion cost limit

On Fri, 30 Apr 2004, l_belev at yahoo dot com wrote:

> The problem is that the limiting of the cost with the cost of 12 adds (the 
> second line) is arbitrary and for some architectures (like pentium4) where 

And therefore at least it should be made runtime controllable with
--param, even if the default stays fixed at 12, so that it is easier to
test the effects of varying it (on code size and performance benchmarks).

Comment 2 Andrew Pinski 2004-05-01 00:11:43 UTC

This should be controlled by the target instead of the normal limit because on something like powerpc 
multiply by an internment is only 6x (well it depends on the processor really) slower so adding too 
many adds will just slow it down.

Comment 3 Jim Wilson 2004-05-01 04:12:53 UTC

Subject: Re:  suboptimal mult-by-const expansion cost
 limit

pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-05-01 00:11 -------
> This should be controlled by the target instead of the normal limit because on something like powerpc 
> multiply by an internment is only 6x (well it depends on the processor really) slower so adding too 
> many adds will just slow it down.

add_cost and mult_cost are both target values, so this is already 
controlled by the target in some sense.

The point of this check is to limit the number of adds we emit to at 
most 12 to avoid unreasonable code size expansion.  So if multiply is 
only 6x the speed of an add, then we will never hit this limit, and you 
don't have to worry about it.

Comment 4 Jim Wilson 2004-05-01 04:16:16 UTC

Subject: Re:  New: suboptimal mult-by-const expansion cost limit

l_belev at yahoo dot com wrote:
> Look in the file gcc/expmed.c, function expand_mult(). The code 
>       mult_cost = rtx_cost (gen_rtx_MULT (mode, op0, op1), SET); 
>       mult_cost = MIN (12 * add_cost, mult_cost); 
> calculates the initial cost limit before the recursive search for 
> a good expansion is invoked. 

As I mentioned on the gcc list, this is apparently used to avoid 
unresaonble code size expansion.  When converting multiplies to 
add-sequences, we limit the number of adds emitted to 12, so that the 
code size increase will be bounded.

I suggested changing the calculation to be 6 * add_cost + 6 * 
sihft_cost, because shifts often have higher costs than adds, and we are 
really emitting a sequence of shifts and adds, not just a sequence of adds.

It may also be reasonable to vary the number based on the optimization 
level, e.g. use a higher limit for -O3.

Comment 5 Luchezar Belev 2004-05-01 13:06:23 UTC

(In reply to comment #4)  
  
> As I mentioned on the gcc list, this is apparently used to avoid   
> unresaonble code size expansion.  When converting multiplies to   
> add-sequences, we limit the number of adds emitted to 12, so that the   
> code size increase will be bounded.  
 
In order to avoid unreasonable code size expansion, it probably would be 
most apropriate to limit the total number of bytes the expansion is allowed 
to span. Of course that's too platform dependent and may be problematic. 
Another variant is to limit the munber of instructions, which although 
not exactly what we need, still is more close to it than limiting the 
number of cycles. IMO the number of cycles vary much more from one 
instruction to another than the number of code bytes, no matter what the 
architecture is. Generally the longer instructions are those with immediate 
operands and in these expansions we probably don't use much of them (or the 
immediates are 8-bit). Even better, AFAIK most RISC architectures have fixed 
instruction sizes, so this way for them we would get exactly what we need.

Comment 6 Luchezar Belev 2004-05-01 14:26:19 UTC

(In reply to comment #5) 
> IMO the number of cycles vary much more from one  
> instruction to another than the number of code bytes, no matter what the  
> architecture is. 
 
I'l try to express this in clearer way. 
Said otherwise: by limiting number of cycles in order to limit code size, 
we effective go through 2 approximations - first we approximate the number 
of the instructions with the number of cycles and then we approximate the 
number of bytes with the number of instructions: 
 num_cycles -> num_insns -> num_bytes. This way the end result could 
be too far from what we wanted. While directly limiting the num_bytes 
is hard to do in the middle-end (it's in the back-end's competence), we could 
at least limit the num_insns, and so avoid one of the two approximations.

Comment 7 GCC Commits 2004-06-24 20:39:05 UTC

Subject: Bug 15239

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	sayle@gcc.gnu.org	2004-06-24 20:39:00

Modified files:
	gcc            : ChangeLog expmed.c 

Log message:
	PR middle-end/15239
	* expmed.c (expand_mult): Remove artificial restriction on the
	maximum cost of a synthetic multiplication sequence.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.4122&r2=2.4123
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/expmed.c.diff?cvsroot=gcc&r1=1.168&r2=1.169

Comment 8 Andrew Pinski 2004-06-24 21:07:35 UTC

Fixed for 3.5.0.