[Bug target/34452] New: Multiply-by-constant pessimation
tege-gcc at swox dot com
gcc-bugzilla@gcc.gnu.org
Thu Dec 13 09:59:00 GMT 2007
The multiply-by-constant optimization work poorly for x86 targets
with multiply units with short latency.
Try some small values M for a sample program like,
long f (long x) { return x * M; }
for e.g. the -mtune=k8 subtarget. One gets slow sequences of
lea, add, sub, sal.
So what's wrong? Is it expmed.c's synth_mult that doesn't measure
costs correctly? Or does i386.c provide poor cost measures?
I'd say that synth_mult's cost model is great for 3 operand
machines, but not very good for x86. It does not see the
additional mov instructions inserted in many of its most clever
sequences, nor does it understand that small shifts are done with
the 2 cycle lea instruction (when src!=dst).
Additionally, synth_mult thinks a 10 operation long sequence that
cost 999 is the way to go if a single mult costs 1000. It does
not take sequence length into account at all. Perhaps it should?
Let's look at some examples of generated code for -mtune=k8.
6: (11 bytes, >= 3 cycles)
leaq (%rdi,%rdi), %rax
salq $3, %rdi
subq %rax, %rdi
10: (12 bytes, >= 4 cycles)
leaq 0(,%rdi,8), %rax
leaq (%rax,%rdi,2), %rax
11: (21 bytes, >= 4 cycles)
leaq 0(,%rdi,4), %rdx
movq %rdi, %rax
salq $4, %rax
subq %rdx, %rax
subq %rdi, %rax
13: (21 bytes, >= 4 cycles)
leaq 0(,%rdi,4), %rdx
movq %rdi, %rax
salq $4, %rax
subq %rdx, %rax
addq %rdi, %rax
etc, etc.
The cycle counts are only if we get ideal parallel execution,
otherwise one additional cycle will be needed. The imul
instruction needs 4 cycles (64-bit operation) and is not
alignment sensitive and needs just one execution slot. It can
therefore execute simultaneously with independent instructions,
while the above sequences will use up much decode, execute,
and retire resources.
What can be done about this?
The simple fix is to pretend multiplication is cheaper:
--- /u/gcc/gcc-4.2.2/gcc/config/i386/.~/i386.c.~1~ Sat Sep 1 17:28:30
2007
+++ /u/gcc/gcc-4.2.2/gcc/config/i386/i386.c Thu Dec 13 10:12:07 2007
@@ -17254,7 +17254,7 @@
op0 = XEXP (op0, 0), mode = GET_MODE (op0);
}
- *total = (ix86_cost->mult_init[MODE_INDEX (mode)]
+ *total = (ix86_cost->mult_init[MODE_INDEX (mode)] - 1
+ nbits * ix86_cost->mult_bit
+ rtx_cost (op0, outer_code) + rtx_cost (op1, outer_code));
This avoids most of the problems, but we still get a 4 cycle,
2 lea sequence for M = 10.
Potential problem: This might affect other parts of the optimizer
than synth_mult. That might be bad, but it might also be desirable.
Another fix would perhaps be to teach synth_mult to understand that
it's generating code for a 2.5 operand machine (one that can only
do "a x= b", not "a = b x c", for some operation x). We should
teach it that there will be moves inserted for sequences that rely
on a source register twice (more or less).
Letting synth_mult take sequence length into account would also
make sense, I think. A cost of 1 per operation does not seem
unreasonable.
(I wrote synth_mult originally.)
--
Summary: Multiply-by-constant pessimation
Product: gcc
Version: 4.2.2
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tege-gcc at swox dot com
GCC target triplet: i386-*-*
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452
More information about the Gcc-bugs
mailing list