[Bug tree-optimization/98138] BB vect fail to SLP one case

Tue Jan 5 08:42:20 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #3)
> 
> IIUC, in current implementation, we get four grouped stores:
>   { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 
> 
> When all these tryings fail, could we do some re-try on the groups
>   { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
> with one extra intermediate layer covering those original groups, then start
> from these newly adjusted groups? the built operands should isomorphic then.
> May be too hackish?

This thought seems impractical in current framework. I was thinking whether we
can use another vectorization way which is sub-optimal but more fits with the
current framework. It only focuses on all things in one loop iteration and
starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can
be:

  w1_0123 = load_word(p1)
  V1_0123 = scalar_to_vec(w1_0123) // also with promotion
  w1_4567 = load_word(p1+4)
  V1_4567 = scalar_to_vec(w1_4567)

  w2_0123 = load_word(p2)
  V2_0123 = scalar_to_vec(w2_0123)
  w2_4567 = load_word(p2+4)
  V2_4567 = scalar_to_vec(w2_4567)

  V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16

  V1 = V_a0123                  // {a0, a1, a2, a3}
  V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2}

  V3 = V1 - V2  // {t1, -t1, t3, -t3}
  V4 = V1 + V2  // {t0,  t0, t2,  t2}

  V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3}
  V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1}

  V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1}
  V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1}

  V9 = vperm (V7, V8, <4, 5, 0, 1>)

  vec_store (tmp[i], V9)

One simplified case can be:

extern void test(unsigned int t[4]);

void foo(int *p1, int i1, int *p2, int i2) {
  unsigned int tmp[4];
  unsigned int a0, a1, a2, a3;

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;

  tmp[0] = t0 + t2;
  tmp[2] = t0 - t2;
  tmp[1] = t1 + t3;
  tmp[3] = t1 - t3;

  test(tmp);
}

Currently we still can't vectorize this simple case. SLP doesn't go deep enough
to expose the group loads chance on the bottom nodes loading p1/p2. It ends up
early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the
operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition
all_uniform_p so "Building parent vector operands from scalars instead".

One rough idea seems:
  1) Relax this condition all_uniform_p somehow to get SLP instance building to
go deeper and get those p1/p2 loads as SLP nodes.
  2) Introduce one more vect_pattern recognizer to catch this kind of pattern,
transform the slp instance as we expect. I assume we can know the whole slp
instance then we can transform it as we want here. Probably need some costing
condition to gate this pattern matching.
  3) If 2) fail, trim the slp instance from those nodes which satisfy
all_uniform_p condition to ensure it's same as before.

Does it sound reasonable?