This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [4.0 and mainline] Fix multiplication by constant expansion


> 
> Hi Jan,
> 
> I'm fairly certain that disabling the splitting of "cost" ties based
> on latency in synth_mult to work around an rtx_costs problem with
> -march=athlon isn't the correct solution.

It is not.  The code produced by the rule is equivalent of

tmp = tmp << 5
tmp += multiplicant

This sequence has latency of shift and add operation, not the shift
operation as expected.  Reducing cost of lea to 1 fixes the problem as
it makes the variant as expensive as the lea variant that has priority,
but it is just an workaround.
> 
> I'm currently investigating a better solution, but it looks like
> the underlying cause is the high cost of "lea" in i386.c:athlon_cost.
> Indeed, as currently parameterized lea isn't cheaper than a shift
> followed by an addition or subtraction, so synth_mult assumes that
> its always better to generate the equivalent atomic steps that could
> potentially be scheduled and issued independently.  And at the point
> synth_mult selects the "16x - 4x - x" sequence the two shifts and two
> subtractions are also cheaper then an athlon imul, which is currently
> parameterized to have a cost of five.  Why a shift (which has a cost
> on one) is transformed into an lea (which has a cost of two) is a
> mystery...

Because i386 is 2 address and the operand need different destination, so
lea is cheaper than shift + mov operation that has same latency and
produces longer code.

However look at sequence produced:


(insn 12 11 13 1 (parallel [
            (set (reg:SI 62)
                (ashift:SI (reg:SI 61)
                    (const_int 2 [0x2])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
            (const_int 4 [0x4]))
        (nil)))

(insn 13 12 14 1 (parallel [
            (set (reg:SI 63)
                (ashift:SI (reg:SI 62)
                    (const_int 2 [0x2])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 14 13 15 1 (parallel [
            (set (reg:SI 64)
                (minus:SI (reg:SI 63)
                    (reg:SI 62)))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
            (const_int 12 [0xc]))
        (nil)))

(insn 15 14 16 1 (parallel [
            (set (reg:SI 65)
                (minus:SI (reg:SI 64)
                    (reg/v:SI 59 [ b ])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
            (const_int 11 [0xb]))
        (nil)))

There is no parallelism in the code at all.  With the patch the sequence
is:

(insn 12 11 13 1 (parallel [
            (set (reg:SI 62)
                (ashift:SI (reg:SI 61)
                    (const_int 2 [0x2])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 13 12 14 1 (parallel [
            (set (reg:SI 63)
                (plus:SI (reg:SI 62)
                    (reg/v:SI 59 [ b ])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
            (const_int 5 [0x5]))
        (nil)))

(insn 14 13 15 1 (parallel [
            (set (reg:SI 64)
                (ashift:SI (reg:SI 63)
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 15 14 16 1 (parallel [
            (set (reg:SI 65)
                (plus:SI (reg:SI 64)
                    (reg/v:SI 59 [ b ])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (expr_list:REG_EQUAL (mult:SI (reg/v:SI 59 [ b ])
            (const_int 11 [0xb]))
        (nil)))

That has shorter latency in the cost model.

You are right, that CSE later manages to introduce some parallelism to
the first sequence, but it is not what the cost computed by mult
expander represent.

The sequence is
t = b * 4; t2 = t * 4; t3 = t - t2; t4 = t4 - b
CSE transforms it to:
t = b * 4; t2 = b * 16; t3 = t - t2; t4 = t4 - b

introducing the parallelism.  There is way mult code can synthetize the
second sollution knowingly if the alg_add_t_m2 / alg_sub_t_m2 was used
with nontrivial multiplicand.  I am going to try patch for that now.

Honza
> 
> Roger
> --
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]