This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [4.0 and mainline] Fix multiplication by constant expansion


> > 
> > Hi Jan,
> > 
> > I'm fairly certain that disabling the splitting of "cost" ties based
> > on latency in synth_mult to work around an rtx_costs problem with
> > -march=athlon isn't the correct solution.
> 
> It is not.  The code produced by the rule is equivalent of
> 
> tmp = tmp << 5
> tmp += multiplicant
> 
> This sequence has latency of shift and add operation, not the shift
> operation as expected.  Reducing cost of lea to 1 fixes the problem as
> it makes the variant as expensive as the lea variant that has priority,
> but it is just an workaround.
> > 
> > I'm currently investigating a better solution, but it looks like
> > the underlying cause is the high cost of "lea" in i386.c:athlon_cost.
> > Indeed, as currently parameterized lea isn't cheaper than a shift
> > followed by an addition or subtraction, so synth_mult assumes that
> > its always better to generate the equivalent atomic steps that could
> > potentially be scheduled and issued independently.  And at the point
> > synth_mult selects the "16x - 4x - x" sequence the two shifts and two
> > subtractions are also cheaper then an athlon imul, which is currently
> > parameterized to have a cost of five.  Why a shift (which has a cost
> > on one) is transformed into an lea (which has a cost of two) is a
> > mystery...
> 
> Because i386 is 2 address and the operand need different destination, so
> lea is cheaper than shift + mov operation that has same latency and
> produces longer code.
> 
> However look at sequence produced:
> 
> 
> (insn 12 11 13 1 (parallel [
>             (set (reg:SI 62)
>                 (ashift:SI (reg:SI 61)
>                     (const_int 2 [0x2])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
>             (const_int 4 [0x4]))
>         (nil)))
> 
> (insn 13 12 14 1 (parallel [
>             (set (reg:SI 63)
>                 (ashift:SI (reg:SI 62)
>                     (const_int 2 [0x2])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (nil))
> 
> (insn 14 13 15 1 (parallel [
>             (set (reg:SI 64)
>                 (minus:SI (reg:SI 63)
>                     (reg:SI 62)))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
>             (const_int 12 [0xc]))
>         (nil)))
> 
> (insn 15 14 16 1 (parallel [
>             (set (reg:SI 65)
>                 (minus:SI (reg:SI 64)
>                     (reg/v:SI 59 [ b ])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
>             (const_int 11 [0xb]))
>         (nil)))
> 
> There is no parallelism in the code at all.  With the patch the sequence
> is:
> 
> (insn 12 11 13 1 (parallel [
>             (set (reg:SI 62)
>                 (ashift:SI (reg:SI 61)
>                     (const_int 2 [0x2])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (nil))
> 
> (insn 13 12 14 1 (parallel [
>             (set (reg:SI 63)
>                 (plus:SI (reg:SI 62)
>                     (reg/v:SI 59 [ b ])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
>             (const_int 5 [0x5]))
>         (nil)))
> 
> (insn 14 13 15 1 (parallel [
>             (set (reg:SI 64)
>                 (ashift:SI (reg:SI 63)
>                     (const_int 1 [0x1])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (nil))
> 
> (insn 15 14 16 1 (parallel [
>             (set (reg:SI 65)
>                 (plus:SI (reg:SI 64)
>                     (reg/v:SI 59 [ b ])))
>             (clobber (reg:CC 17 flags))
>         ]) -1 (nil)
>     (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
>             (const_int 11 [0xb]))
>         (nil)))
> 
> That has shorter latency in the cost model.

Well, actually not, it is also 4 instructions, just more "canonical".
Hmm...
I guess I have to dig into getting the parallelism exposed directly but
even if I got it right, we would end up with the 2 address issues we see
on the testcase.  Let me check.

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]