This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [4.0 and mainline] Fix multiplication by constant expansion
> >
> > Hi Jan,
> >
> > I'm fairly certain that disabling the splitting of "cost" ties based
> > on latency in synth_mult to work around an rtx_costs problem with
> > -march=athlon isn't the correct solution.
>
> It is not. The code produced by the rule is equivalent of
>
> tmp = tmp << 5
> tmp += multiplicant
>
> This sequence has latency of shift and add operation, not the shift
> operation as expected. Reducing cost of lea to 1 fixes the problem as
> it makes the variant as expensive as the lea variant that has priority,
> but it is just an workaround.
> >
> > I'm currently investigating a better solution, but it looks like
> > the underlying cause is the high cost of "lea" in i386.c:athlon_cost.
> > Indeed, as currently parameterized lea isn't cheaper than a shift
> > followed by an addition or subtraction, so synth_mult assumes that
> > its always better to generate the equivalent atomic steps that could
> > potentially be scheduled and issued independently. And at the point
> > synth_mult selects the "16x - 4x - x" sequence the two shifts and two
> > subtractions are also cheaper then an athlon imul, which is currently
> > parameterized to have a cost of five. Why a shift (which has a cost
> > on one) is transformed into an lea (which has a cost of two) is a
> > mystery...
>
> Because i386 is 2 address and the operand need different destination, so
> lea is cheaper than shift + mov operation that has same latency and
> produces longer code.
>
> However look at sequence produced:
>
>
> (insn 12 11 13 1 (parallel [
> (set (reg:SI 62)
> (ashift:SI (reg:SI 61)
> (const_int 2 [0x2])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
> (const_int 4 [0x4]))
> (nil)))
>
> (insn 13 12 14 1 (parallel [
> (set (reg:SI 63)
> (ashift:SI (reg:SI 62)
> (const_int 2 [0x2])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (nil))
>
> (insn 14 13 15 1 (parallel [
> (set (reg:SI 64)
> (minus:SI (reg:SI 63)
> (reg:SI 62)))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
> (const_int 12 [0xc]))
> (nil)))
>
> (insn 15 14 16 1 (parallel [
> (set (reg:SI 65)
> (minus:SI (reg:SI 64)
> (reg/v:SI 59 [ b ])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
> (const_int 11 [0xb]))
> (nil)))
>
> There is no parallelism in the code at all. With the patch the sequence
> is:
>
> (insn 12 11 13 1 (parallel [
> (set (reg:SI 62)
> (ashift:SI (reg:SI 61)
> (const_int 2 [0x2])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (nil))
>
> (insn 13 12 14 1 (parallel [
> (set (reg:SI 63)
> (plus:SI (reg:SI 62)
> (reg/v:SI 59 [ b ])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
> (const_int 5 [0x5]))
> (nil)))
>
> (insn 14 13 15 1 (parallel [
> (set (reg:SI 64)
> (ashift:SI (reg:SI 63)
> (const_int 1 [0x1])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (nil))
>
> (insn 15 14 16 1 (parallel [
> (set (reg:SI 65)
> (plus:SI (reg:SI 64)
> (reg/v:SI 59 [ b ])))
> (clobber (reg:CC 17 flags))
> ]) -1 (nil)
> (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
> (const_int 11 [0xb]))
> (nil)))
>
> That has shorter latency in the cost model.
Well, actually not, it is also 4 instructions, just more "canonical".
Hmm...
I guess I have to dig into getting the parallelism exposed directly but
even if I got it right, we would end up with the 2 address issues we see
on the testcase. Let me check.
Honza