[PATCH] match.pd: undistribute (a << s) & C, when C = (M << s) and exact_log2(M - 1)

Wed Nov 11 22:06:27 GMT 2020

On Wed, Nov 11, 2020 at 01:43:00PM -0800, Jim Wilson wrote:
> > On Wed, Nov 11, 2020 at 11:43:34AM +0100, Philipp Tomsich wrote:
> > > The patch addresses this by disallowing that rule, if an exact
> > power-of-2 is
> > > seen as C1.  The reason why I would prefer to have this canonicalised the
> > > same way the (X & C1) * C2 is canonicalised, is that cleaning this up
> > during
> > > combine is more difficult on some architectures that require multiple
> > insns
> > > to represent the shifted constant (i.e. C1 << C2).
> >
> > As I said, it is better to decide which one is better before or during
> > expansion based on target costs, sure, combine can't catch everything.
> >
> 
> it could be fixed in combine if we allowed 4 instructions to split into 3.
> We allow combinations of 4 insns, but we only allow splits into 2 insns as
> far as I know.

If combiner can do it, good, but I think it would be still helpful for many
targets to decide it during expansion, we have TER and so on
expansion of BIT_AND_EXPR with a constant second operand we can check
if the other one is a left shift and then based on the exact mask
decide if it is useful to try different possibilities and ask for their
costs (try (x & c1) << c2, (x << c3) & c4 and perhaps (x << c5) >> c6).
We already have other cases where during expansion we try multiple things
and compare their costs.  E.g. for division and modulo, if we know that the
most significant bit is clear on both of the operands, we can expand as both
signed or unsigned division/modulo and so we try both and pick the one
which is less costly on the target.  Similarly, I think we could do it
for right shifts, if we know the most significant bit is clear, we can
expand it as both arithmetic and logical right shift, so we can ask the
target what it prefers cost-wise.

E.g. on x86_64
unsigned long long
foo (unsigned long long x)
{
  return (x << 47) >> 17;
}

unsigned long long
bar (unsigned long long x)
{
  return (x & 0x1ffff) << 30;
}

unsigned long long
baz (unsigned long long x)
{
  unsigned long long y;
  __asm ("" : "=r" (y) : "0" (x & 0x1ffff));
  return y << 30;
}

unsigned long long
qux (unsigned long long x)
{
  return (x << 30) & (0x1ffffULL << 30);
}
foo is 4 instructions 12 bytes, baz is 4 instructions 13 bytes,
bar/qux are 5 instructions 21 bytes, which of foo or baz is faster would
need to be benchmarked and no idea what the rtx costs would say, but bar/qux
certainly looks more costly and I'm quite sure the rtx costs would say that
too.

The reason for the match.pd canonicalization is I bet it wants to put
possible multiple shifts adjacent and similarly with the masks, so that when
one uses
(((x & c1) << c2) & c3) << c4 etc. one can simplify that into just one shift
and one masking; and having the canonicalization do it one way for some
constants/shift pairs and another way for others wouldn't achieve that.

	Jakub