[PATCH] Peephole x86 multiplications by 3, 5 and 9

Mon May 3 14:02:00 GMT 2004

On Mon, 3 May 2004, Kazu Hirata wrote:
> > GCC decomposes integer multiplications with a constant operand into
> > a sequence of add and shift operations, if the constant is visible
> > during RTL expansion.  However, propagation of constants by the RTL
> > optimizers can result with explicit multiplication instructions with
> > immediate constant operands.
> >
> > For example, on i686-pc-linux-gnu, the following source code generates
> > an "imul $3, 8(%ebp), %eax" instruction.
> >
> > int foo(int x)
> > {
> >   int y = 3;
> >   return x*y;
> > }
>
> IMHO, I'd say we have a "bug" in the tree SSA optimizers if we end up
> discovering in a RTL optimizer that one of the operands of a
> multiplication is constant.  (Currently, the tree SSA optimizers do
> not do instruction combination among other things done at RTL level,
> so it is possible that we are missing opportunities to discover some
> constants.)

Currently in GCC constant discovery can occur extremely late.  For
example, PR optimization/12845 is caused by the register allocator(!)
spotting that a given pseudo is always constant and propagating a
zero into an instruction.  This constant wasn't spotted by CSE or
GCSE or any of the earlier passes, probably because it was exposed
by a later change.  Because reload doesn't attempt to simplify
the resulting instruction, and this happens too late for another pass
to simplify it afterwards, the generated "always true/false" RTL
reaches the back-end, which then gets confused.

So although tree-ssa should vastly diminish the problem, I expect
it won't make it go away.

Indeed, it is the arrival of tree-ssa that makes it reasonable to expect
the backends the perform single instruction replacement themselves.
Previously, the expectation was that the RTL optimizers would learn
to decompose multiplication and division by constants into multiple
instructions: http://gcc.gnu.org/ml/gcc-patches/2001-10/msg01434.html
Tree-ssa, by reducing te number of such constants, means that its
less likely the RTL optimizers will ever perform these optimizations.

> Currently, the combiner does not seem to convert "a * 2 + a" into
> "a * 3".

Clearly this is the correct thing to do in this case.  RTX_COST (a*3)
is significantly higher than RTX_COST(a*2+a).  The problem remains
that the backend has two instructions for multiplication by three, and
therefore should never emit the slow one.  Avoiding "imul $0", "imul $1"
and "imul $2" can reasonably be left to the RTL optimizers, but the
presence of efficient instruction replacements for "imul $3", "imul $5"
and "imul $9" makes this a target-specific issue.

An alternative strategy is for the x86 backend to tweaks its
mulsi3_1 pattern, to recognize the above special cases itself.  Which
would avoid the need for the peephole2s (much as a mov pattern can
special case a const_int zero operand and emit an xor instead).

Needless to say a large fraction of scientific and graphics software
operates on arrays of three-dimensional co-ordinates or uses 3x3
transformation matrices, where nearly ever array index contains a
multiplication by three.

Roger
--