[Bug target/34452] New: Multiply-by-constant pessimation

Thu Dec 13 09:59:00 GMT 2007

The multiply-by-constant optimization work poorly for x86 targets
with multiply units with short latency.

Try some small values M for a sample program like,

  long f (long x) { return x * M; }

for e.g. the -mtune=k8 subtarget.  One gets slow sequences of
lea, add, sub, sal.

So what's wrong?  Is it expmed.c's synth_mult that doesn't measure
costs correctly?  Or does i386.c provide poor cost measures?

I'd say that synth_mult's cost model is great for 3 operand
machines, but not very good for x86.  It does not see the
additional mov instructions inserted in many of its most clever
sequences, nor does it understand that small shifts are done with
the 2 cycle lea instruction (when src!=dst).

Additionally, synth_mult thinks a 10 operation long sequence that
cost 999 is the way to go if a single mult costs 1000.  It does
not take sequence length into account at all.  Perhaps it should?

Let's look at some examples of generated code for -mtune=k8.

6:  (11 bytes, >= 3 cycles)
        leaq    (%rdi,%rdi), %rax
        salq    $3, %rdi
        subq    %rax, %rdi

10: (12 bytes, >= 4 cycles)
        leaq    0(,%rdi,8), %rax
        leaq    (%rax,%rdi,2), %rax

11: (21 bytes, >= 4 cycles)
        leaq    0(,%rdi,4), %rdx
        movq    %rdi, %rax
        salq    $4, %rax
        subq    %rdx, %rax
        subq    %rdi, %rax

13: (21 bytes, >= 4 cycles)
        leaq    0(,%rdi,4), %rdx
        movq    %rdi, %rax
        salq    $4, %rax
        subq    %rdx, %rax
        addq    %rdi, %rax

etc, etc.

The cycle counts are only if we get ideal parallel execution,
otherwise one additional cycle will be needed.  The imul
instruction needs 4 cycles (64-bit operation) and is not
alignment sensitive and needs just one execution slot.  It can
therefore execute simultaneously with independent instructions,
while the above sequences will use up much decode, execute,
and retire resources.

What can be done about this?

The simple fix is to pretend multiplication is cheaper:

--- /u/gcc/gcc-4.2.2/gcc/config/i386/.~/i386.c.~1~      Sat Sep  1 17:28:30
2007
+++ /u/gcc/gcc-4.2.2/gcc/config/i386/i386.c     Thu Dec 13 10:12:07 2007
@@ -17254,7 +17254,7 @@
                op0 = XEXP (op0, 0), mode = GET_MODE (op0);
            }

-         *total = (ix86_cost->mult_init[MODE_INDEX (mode)]
+         *total = (ix86_cost->mult_init[MODE_INDEX (mode)] - 1
                    + nbits * ix86_cost->mult_bit
                    + rtx_cost (op0, outer_code) + rtx_cost (op1, outer_code));

This avoids most of the problems, but we still get a 4 cycle,
2 lea sequence for M = 10.

Potential problem: This might affect other parts of the optimizer
than synth_mult.  That might be bad, but it might also be desirable.

Another fix would perhaps be to teach synth_mult to understand that
it's generating code for a 2.5 operand machine (one that can only
do "a x= b", not "a = b x c", for some operation x).  We should
teach it that there will be moves inserted for sequences that rely
on a source register twice (more or less).

Letting synth_mult take sequence length into account would also
make sense, I think.  A cost of 1 per operation does not seem
unreasonable.

(I wrote synth_mult originally.)


-- 
           Summary: Multiply-by-constant pessimation
           Product: gcc
           Version: 4.2.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tege-gcc at swox dot com
GCC target triplet: i386-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452