This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Multiplication


On Fri, Sep 07, 2001 at 11:12:55AM +0200, Jan Hubicka wrote:
> > The Multiplication cost table for the Pentium 4 may be at monday.
> > I'm not sure about the search depth. 5 or 6. For a search depth of 6
> > may be a weekend be too short.
>
> P4 will be interesting, due to extreme cost of multiply and shifts :)
>
I need a search depth of up to 28. This needs something around 10^41 years
for my current program for searching the best solution ...

Currently I'm only search two register solutions. Is this enough for IA32?
For CPUs with much more registers (IA64, PPC, Alpha) you also interested in
three or four register solutions?

How does the multiplication code generator work? My stupid idea is:

  - Code_Proposal_Generator [size] (multiplier)
    Code_Proposal_Generator [time] (CPU_type, multiplier)

        emits some known useful proposals for  a *= b;

  - Multiple solutions are necessary because there are also other conditions
    affecting which of them is the best:
        src: mem/reg? 
        dst: mem/reg?
        src!=dst?
        # of free registers?

  - Multiple solutions within small code ranges will be weighted.

  - All solutions have their calculated costs:
      - latency
      - throughput (for larger peepholes throughput stalls will be
        calculated into additional latency)
      - registers used (especially on systems with rare regs)
        (for larger peepholes register usage will be
        calculated into additional load/stores and related latency)
      - code size

  - For very large peeopholes (;-) you have all transformed into
      - latency
      - code size
  
  - Depending on the -O setting the "best" solution is selected.
    On the Pentium 4 fast solutions are often huge and the 1st level code
    cache is very small. So size also affects speed. But this also depends
    on the use pattern. This smell
    like a more sophisticated solution for code optimzation like
    some switches -O -O2 -Os and -O3.

Proposal:


Cost from the view of compile time:

-O0	no optimization, compile as fast as you can! using as less memory as you can
-O1	simple optimizations, speed and memory usage should increase by 10...20% (typical)
-O2	optimizations, speed and memory usage should increase by 50...100% (typical)
-O3	high optimizations, speed and memory usage should increase by 200...1000% (typical)
-O4	highest optimizations, speed and memory usage should increase by 1000...100000% (typical)

This can be done by switch on/off optimization features and setting search
depth and things like that (-O4 unlimited).

parameters like treedepth and things like that.


-Os1...9
-Ot1...9
-Os0		(== -Ot0)
-Ot0		(== -Os0)
-Os		(== -Os5)

old -O  is something like -Ot2
old -O2 is something like -Ot5
old -O3 is something like -Ot8

This selects size <=> speed trade off. So -O4 -Os9 generates smallest code
using all possible methods, also it tries all methods normally known to
increase code size (-O4).

-Os9 affects RTL optimization (never use a solution which is one bit
longer), but also code generator:

int two ( void )
{
    return 2;
}

-Ot9...-Ot0, -Os0...-Ot5:


	movl	$2, %eax
	ret

-Ot6...9:

	xorl	%eax, %eax
	movb	$2, %al		# saves 1 byte
	ret

-- 
Frank Klemm


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]