[Bug rtl-optimization/56766] Fails to combine (vec_select (vec_concat ...)) to (vec_merge ...)

Tue Jun 16 10:43:00 GMT 2015

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56766

--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 12 Jun 2015, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56766
> 
> --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Uroš Bizjak from comment #23)
> 
> > Testcases, compile with "-O2 -ftree-vectorize -mavx".
> 
> Richi, please note that tree-vectorizer doesn't vectorize bar_v2df, at least
> there is no VEC_PERM_EXPR in the .optimized dump:
> 
> void bar_v2df (double * __restrict__ p, double * __restrict q)
> {
>   p[0] = p[0] - q[0];
>   p[1] = p[1] + q[1];
> }

That's because of (unless you specify -fno-vect-cost-model):

t.c:3:11: note: Cost model analysis:
  Vector inside of basic block cost: 9
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 8
t.c:3:11: note: not vectorized: vectorization is not profitable.

so it computes a too high vectorized cost.  This is because the
target unspecific code handling this is estimating the cost as
needing both the add and the subtract and the shuffle.  The
target vectorizer cost hook could adjust this to a more sensible
value if addsubpd is available.

> Another question w.r.t. to foo_* testcases that use __builtin_shuffle:
> 
> v4sf foo_v4sf (v4sf x, v4sf y)
> {
>   v4sf tem0 = x - y;
>   v4sf tem1 = x + y;
>   return __builtin_shuffle (tem0, tem1, (v4si) { 0, 5, 2, 7 });
> }
> 
> is functionaly equivalent to:
> 
> v4sf foo_v4sf (v4sf x, v4sf y)
> {
>   v4sf tem0 = x + y;
>   v4sf tem1 = x - y;
>   return __builtin_shuffle (tem0, tem1, (v4si) { 4, 1, 6, 3 });
> }
> 
> But the later construct isn't simplified. Should we declare canonical form as
> the one with "element 0 from the first operand"?

That one is interesting.  I'd say we'd need to define a total ordering
here.  Note that a canonical form is only accepted when the target accepts
it (see the VEC_PERM_EXPR case in fold-const.c).

So, if we can write a function compare_perm_for_canonical (unsigned char 
*sel1, unsigned char *sel2, unsigned n) we could use that to determine
if swapping arg0 and arg1 makes the permute mask more canonical.

So yes, we should have a canonical form for the above and yes, we
could say that we order after element0 and if that is equal after
element1, and so on.