[PATCH] tree-optimization/94865 - combine BIT_INSERT_EXPR of BIT_FIELD_REF

Thu May 7 12:40:27 GMT 2020

On Thu, 7 May 2020, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Thu, 7 May 2020, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > This implements patterns combining vector element insertion of
> >> > vector element extraction to a VEC_PERM_EXPR of both vectors
> >> > when supported.  Plus it adds the more generic identity transform
> >> > of inserting a piece of itself at the same position.
> >> >
> >> > Richard - is there anything I can do to make this SVE aware?
> >> > I'd need to construct an identity permute and "insert" into
> >> > that permute that element from the other (or same) vector.
> >> > I suppose for most element positions that won't work but
> >> > at least inserting at [0] should?  I'm mostly struggling
> >> > on how to use vec_perm_builder here when nelts is not constant,
> >> > since it's derived from vec<> can I simply start with
> >> > a single pattern with 1 stride and then insert by using []?
> >> 
> >> I guess for SVE we still want to know that the range is safe
> >> for all VL, so after dropping the is_constant check, we'd
> >> want something like:
> >> 
> >>    {
> >>      poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
> >>      unsigned int min_nelts = constant_lower_bound (nelts);
> >>    }
> >>    (if (...
> >>         && at + elemsz <= min_nelts)
> >> 
> >> In theory (hah) it should then just be a case of changing the
> >> vec_perm_builder constructor to:
> >> 
> >>           vec_perm_builder sel (nelts, min_nelts, 3);
> >> 
> >> and then iterating over min_nelts * 3 instead of nelts here:
> >> 
> >> > +       for (unsigned i = 0; i < nelts; ++i)
> >> > +         sel.quick_push (i / elemsz == at
> >> > +			 ? nelts + elem * elemsz + i % elemsz : i);
> >> 
> >> So as far as the encoding goes, the first min_nelts elements are arbitrary
> >> values, and the following two min_nelts sequences form individual linear
> >> series.
> >
> > OK - not sure why we need exactly three nelts per pattern here.
> 
> There are three styles of encoding (see the VECTOR_CST docs in
> generic.texi for the full gory details):
> 
> - replicated {a0,...,an} (1 element per pattern)
> 
> - {a0,...,an} followed by replicated {b0,...,bn} (2 elements per pattern)
> 
> - {a0,...,an} followed by {b0,...,bn,b0+step0,...,bn+stepn,b0+step0*2,...}
>   (3 elements per pattern)
> 
> The min_elts check ensures that the difference from the identity permute
> selector is all in {a0,...,an}.  The rest of the vector contains the normal
> elements for an identity selector and extends for as long as the runtime
> VL needs it to extend.
> 
> > It also looks like all the constant_multiple_p () checks constrain
> > things quite a bit.
> 
> I don't think it constrains it beyond what we can reasonably do.
> For SVE this is most likely to be useful when converting between
> SVE and Advanced SIMD.
> 
> > Oh, and does a BIT_FIELD_REF with poly-int position
> > extract multiple elements in the end?!  For the case we are extracting
> > a sub-vector and thus elemsz != 1 we constrain it so that this
> > sub-vector is not of variable size (err, not "independently" so,
> > whatever that means..)?
> 
> No, a poly-int position doesn't change how many elements we extract.
> It just defers the calculation of the position until runtime.
> 
> > My brain hurts...  how do you write a GIMPLE testcase for aarch64
> > SVE covering such cases?
> 
> Gimple testcase: with difficulty :-)  I don't think we have a
> gimple FE syntax for poly_ints yet.
> 
> It might be possible to construct a C testcase using intrinsics.
> I'll give it a go...
> 
> >> This ought to be work for both SVE and non-SVE, although obviously
> >> there's a bit of wasted work for non-SVE.
> >> 
> >> (And thanks for asking :-))
> >
> > So like this, it seems to still work on the x86 testcases?
> 
> LGTM.  I think the elemsz calculate is going to run into the same
> kind of trouble as PR94980 for AVX/SVE vector booleans, but that
> shouldn't hold the patch up.

I see.  Looks like I have to fix PR88540 first as I see both
gcc.target/i386/pr54855-8.c and gcc.target/i386/pr54855-9.c FAIL
after the patch.  Those are testcases for fancy combiner patterns
involving min/max operations on the [0] lane, leaving the rest of
the vector unmodified.  With the patch we enter PRE with

  <bb 2> [local count: 1073741824]:
  _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
  if (_1 > a_4(D))
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 536870913]:

  <bb 4> [local count: 1073741824]:
  # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
  x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
  return x_5;

and of course PRE sees that on one path x is unchanged and turns
it into

  <bb 2> [local count: 1073741824]:
  _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
  if (_1 > a_4(D))
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 536870912]:
  _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;

  <bb 4> [local count: 1073741824]:
  # prephitmp_8 = PHI <_7(3), x_3(D)(2)>
  return prephitmp_8;

defeating RTL if-conversion (there's some IEEE_MIN/MAX UNSPECs
in the backend).  Due to no -ffast-math we cannot use MIN/MAX_EXPR
here on GIMPLE (testcase passes with -ffast-math) but we have
IFN_FMIN/MAX for IEEE compatible fmin/max but neither does the x86
backend have expanders for those nor does phiopt use those
without -ffast-math.

So I'll delay the patch a bit and give that a go (I have halfway
finished phiopt support somewhere lying around).

Richard.