98138 – BB vect fail to SLP one case

Bug 98138 - BB vect fail to SLP one case

Summary: BB vect fail to SLP one case

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	11.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	spec vectorizer 101929
	Show dependency tree / graph

Reported:	2020-12-04 10:46 UTC by Kewen Lin
Modified:	2024-08-17 02:07 UTC (History)
CC List:	11 users (show)

See Also:	101929 114109 113600 116352
Host:
Target:	ppc64le-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2023-12-11 00:00:00

Attachments
vectorized with altivec built-in functions (1.80 KB, text/x-csrc) 2021-01-12 07:25 UTC, Kewen Lin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kewen Lin 2020-12-04 10:46:45 UTC

Test case:

  extern void test(unsigned int t[4][4]);

  void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
  {
    unsigned int tmp[4][4];
    unsigned int a0, a1, a2, a3;

    for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
      a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
      a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
      a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
    }
    test(tmp);
  }

The expected code on ppc64le can look like:

  // p1 byte 0 to byte 7 
  d1_0_7 = load_dword(p1)
  // p1+i1 b0 to b7, rename it as 8 to 15       
  d1_8_15 = load_dword(p1 + i1)
  d1_16_23 = load_dword(p1 + 2*i1) 
  d1_24_31 = load_dword(p1 + 3*i1)

  V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
  V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
  V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
  V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})

  // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all

  // Do the subtraction together (all 4x4 bytes)
  V_sub1 = V_d1_0_3_all - V_d2_0_3_all
  V_sub2 = V_d1_4_7_all - V_d2_4_7_all
  
  // Do some unpack and get the promoted vector int
  V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
  V_a0_1 = V_a0_tmp << 16
  V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
  // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
  V_a0 = V_a0_0 + V_a0_1
  
  // Get the similar for V_a1, V_a2, V_a3

  // Compute t0/t1/t2/t3
  // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
  V_t0 = V_a0 + V_a1  
  V_t1 = V_a0 - V_a1
  V_t2 = V_a2 + V_a3
  V_t3 = V_a2 - V_a3

  // Compute tmps
  // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
  V_tmp0 = V_t0 + V_t2
  V_tmp2 = V_t0 - V_t2
  V_tmp1 = V_t1 + V_t3
  V_tmp3 = V_t1 - V_t3
    
  // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
  // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3

Comment 1 Kewen Lin 2020-12-04 10:52:53 UTC

Similar case is x264_pixel_satd_8x4 in x264
https://github.com/mirror/x264/blob/4121277b40a667665d4eea1726aefdc55d12d110/common/pixel.c#L288

Comment 2 Richard Biener 2020-12-04 12:19:10 UTC

So the expected vectorization builds vectors

 { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }

that's not SLP, SLP tries to build the

 { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }

vector and "succeeds" - the SLP tree turns out to be
highly inefficient though.  So for the stores your desire
is to see an interleaving scheme with VF 4 (the number of
iterations).  But interleaving fails because it would require
a VF of 16 and there are not enough iteration in the loop.

The classical SLP scheme degenerates (also due to the plus/minus
mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
expression.

Starting SLP discovery from the grouped loads would get things going
up to the above same expression.

So not sure what's the best approach to this case.  The testcase
can be simplified still showing the SLP discovery issue:

extern void test(unsigned int t[4][4]);

void foo(int *p1, int i1, int *p2, int i2)
{
  unsigned int tmp[4][4];
  unsigned int a0, a1, a2, a3;

  for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]);
      a1 = (p1[1] - p2[1]);
      a2 = (p1[2] - p2[2]);
      a3 = (p1[3] - p2[3]);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
  }
  test(tmp);
}

So it's basically SLP discovery degenerating to an interleaving scheme
on the load side but not actually "implementing" it.

Comment 3 Kewen Lin 2020-12-07 03:10:38 UTC

(In reply to Richard Biener from comment #2)
> So the expected vectorization builds vectors
> 
>  { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }
> 
> that's not SLP, SLP tries to build the
> 
>  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }
> 
> vector and "succeeds" - the SLP tree turns out to be
> highly inefficient though.  So for the stores your desire
> is to see an interleaving scheme with VF 4 (the number of
> iterations).  But interleaving fails because it would require
> a VF of 16 and there are not enough iteration in the loop.
> 
> The classical SLP scheme degenerates (also due to the plus/minus
> mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
> expression.
> 
> Starting SLP discovery from the grouped loads would get things going
> up to the above same expression.
> 
> So not sure what's the best approach to this case.  The testcase
> can be simplified still showing the SLP discovery issue:
> 
> extern void test(unsigned int t[4][4]);
> 
> void foo(int *p1, int i1, int *p2, int i2)
> {
>   unsigned int tmp[4][4];
>   unsigned int a0, a1, a2, a3;
> 
>   for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>       a0 = (p1[0] - p2[0]);
>       a1 = (p1[1] - p2[1]);
>       a2 = (p1[2] - p2[2]);
>       a3 = (p1[3] - p2[3]);
> 
>       int t0 = a0 + a1;
>       int t1 = a0 - a1;
>       int t2 = a2 + a3;
>       int t3 = a2 - a3;
> 
>       tmp[i][0] = t0 + t2;
>       tmp[i][2] = t0 - t2;
>       tmp[i][1] = t1 + t3;
>       tmp[i][3] = t1 - t3;
>   }
>   test(tmp);
> }
> 
> So it's basically SLP discovery degenerating to an interleaving scheme
> on the load side but not actually "implementing" it.

IIUC, in current implementation, we get four grouped stores:
  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 

When all these tryings fail, could we do some re-try on the groups
  { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
with one extra intermediate layer covering those original groups, then start from these newly adjusted groups? the built operands should isomorphic then. May be too hackish?

Comment 4 Kewen Lin 2021-01-05 08:42:20 UTC

(In reply to Kewen Lin from comment #3)
> 
> IIUC, in current implementation, we get four grouped stores:
>   { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 
> 
> When all these tryings fail, could we do some re-try on the groups
>   { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
> with one extra intermediate layer covering those original groups, then start
> from these newly adjusted groups? the built operands should isomorphic then.
> May be too hackish?

This thought seems impractical in current framework. I was thinking whether we can use another vectorization way which is sub-optimal but more fits with the current framework. It only focuses on all things in one loop iteration and starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can be:

  w1_0123 = load_word(p1)
  V1_0123 = scalar_to_vec(w1_0123) // also with promotion
  w1_4567 = load_word(p1+4)
  V1_4567 = scalar_to_vec(w1_4567)

  w2_0123 = load_word(p2)
  V2_0123 = scalar_to_vec(w2_0123)
  w2_4567 = load_word(p2+4)
  V2_4567 = scalar_to_vec(w2_4567)

  V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16

  V1 = V_a0123                  // {a0, a1, a2, a3}
  V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2}

  V3 = V1 - V2  // {t1, -t1, t3, -t3}
  V4 = V1 + V2  // {t0,  t0, t2,  t2}

  V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3}
  V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1}

  V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1}
  V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1}

  V9 = vperm (V7, V8, <4, 5, 0, 1>)

  vec_store (tmp[i], V9)

One simplified case can be:

extern void test(unsigned int t[4]);

void foo(int *p1, int i1, int *p2, int i2) {
  unsigned int tmp[4];
  unsigned int a0, a1, a2, a3;

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;

  tmp[0] = t0 + t2;
  tmp[2] = t0 - t2;
  tmp[1] = t1 + t3;
  tmp[3] = t1 - t3;

  test(tmp);
}

Currently we still can't vectorize this simple case. SLP doesn't go deep enough to expose the group loads chance on the bottom nodes loading p1/p2. It ends up early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition all_uniform_p so "Building parent vector operands from scalars instead".

One rough idea seems:
  1) Relax this condition all_uniform_p somehow to get SLP instance building to go deeper and get those p1/p2 loads as SLP nodes.
  2) Introduce one more vect_pattern recognizer to catch this kind of pattern, transform the slp instance as we expect. I assume we can know the whole slp instance then we can transform it as we want here. Probably need some costing condition to gate this pattern matching.
  3) If 2) fail, trim the slp instance from those nodes which satisfy all_uniform_p condition to ensure it's same as before.

Does it sound reasonable?

Comment 5 Kewen Lin 2021-01-06 03:29:46 UTC

(In reply to Kewen Lin from comment #4)
> One rough idea seems:
>   1) Relax this condition all_uniform_p somehow to get SLP instance building
> to go deeper and get those p1/p2 loads as SLP nodes.
>   2) Introduce one more vect_pattern recognizer to catch this kind of
> pattern, transform the slp instance as we expect. I assume we can know the
> whole slp instance then we can transform it as we want here. Probably need
> some costing condition to gate this pattern matching.
>   3) If 2) fail, trim the slp instance from those nodes which satisfy
> all_uniform_p condition to ensure it's same as before.
> 

For 2), instead of vect_pattern with IFN, the appropriate place seems to be vect_optimize_slp.

But after more thinking, building SLP instance starting from group loads instead of group stores looks more straightforward. 

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

Building the vector <a0, a1, a2, a3> looks more natural and then check the uses of its all lanes and special patterns to have vector <t0, t1, t2, t3> and repeat similarly.

Hi Richi,

Is this a good example to request SLP instance build starting group loads?

Comment 6 Richard Biener 2021-01-06 09:48:38 UTC

Starting from the loads is not how SLP discovery works so there will be
zero re-use of code.  Sure - the only important thing is you end up
with a valid SLP graph.

But going back to the original testcase and the proposed vectorization
for power - is that faster in the end?

For the "rewrite" of the vectorizer into all-SLP we do have to address
that "interleaving scheme not carried out as interleaving" at some point,
but that's usually for loop vectorization - for BB vectorization all
we have is optimize_slp.  I have patches that would build the vector load
SLP node (you still have to kill that 'build from scalars' thing to make
it trigger ).  But then we end up with a shared vector load node and
N extract/splat operations at the 'scalar' points.  It's not entirely
clear to me how to re-arrange the SLP graph at that point.

Btw, on current trunk the simplified testcase no longer runs into the
'scalar operand' build case but of course vectorization is thought to be
not profitable.  pattern recog of the plus/minus subgraphs may help
(not sure if ppc has those as instruction, x86 has).

That said, "failure" to identify the common (vector) load is known
and I do have experimental patches trying to address that but did
not yet arrive at a conclusive "best" approach.

Comment 7 Kewen Lin 2021-01-12 07:23:42 UTC

(In reply to Richard Biener from comment #6)
> Starting from the loads is not how SLP discovery works so there will be
> zero re-use of code.  Sure - the only important thing is you end up
> with a valid SLP graph.
> 
> But going back to the original testcase and the proposed vectorization
> for power - is that faster in the end?
> 

Good question. I coded foo2 and foo3 with altivec built-in functions as the proposals, foo2 is for the original version (#c0) and foo3 is for the sub-optimial version (#c4), they have been verified with some random inputs to ensure the functionality. 

I evaluated the timings on Power8/Power9/Power10(pre) with option -Ofast and corresponding -mcpu=power{8,9,10} for the build. Using foo1 (scalar version) as the baseline 100, do the normalization, the speed-ups look below (more bigger, more better):

    foo1   foo2     foo3
P8  100   163.26   164.97
P9  100   101.15   101.60
P10 100   172.42   159.10

The evaluation shows the vectorized versions can have big gains on Power8 and Power10, but the gain is trivial on Power9.  I tried to evaluate P8 executable on P9, it didn't have any remarkable differences. There are some known issues on P9 for fp/vect intensive workloads, it's probably due to that since the vectorized versions are full of vector codes. Sigh, we might have to make different costs for P9 but that's another story. Btw, I tried to run P9 executable on P10, it can also gains that much like what we have in P10 row. So it's profitable to vectorize this as P8 and P10 testing show.

> For the "rewrite" of the vectorizer into all-SLP we do have to address
> that "interleaving scheme not carried out as interleaving" at some point,
> but that's usually for loop vectorization - for BB vectorization all
> we have is optimize_slp.  I have patches that would build the vector load
> SLP node (you still have to kill that 'build from scalars' thing to make
> it trigger ).  But then we end up with a shared vector load node and
> N extract/splat operations at the 'scalar' points.  It's not entirely
> clear to me how to re-arrange the SLP graph at that point.
> 

Great, it's good to know we can end up with a shared vector load node. It looks we can further simplify the graph on top of it, for #c4 simplified case, assuming we have two shared vector load nodes
  N1 = <p1[0], p1[1], p1[2], p1[3]>
  N2 = <p2[0], p2[1], p2[2], p2[3]>

1) If all lanes of one shared vector load node only works with all lanes of another shared vector load nodes, we can move up (postpone) the extract/splat operations.  For this case, we needs 4 extract/splat for p1's lanes, 4 extract/splat for p2's lanes, 4 minus for {T,T,T,T} /T=a{0,1,2,3}/, it can be simplified to one shared node N3 = N1 - N2 (1 minus) and 4 extract/splat.

2) And then we have the shared vector node <a0,a1,a2,a3>, before it's going
to extract/splat to {T,T,T,T} /T=a{0,1,2,3}/, it can check all the uses of extract/splat lanes, try to detect some pattern and further simplify. For this case, it can be permuted to <a1,a0,a3,a2>, do the minus/plus and the permutations to get the nodes N4 <_13, _14, _13, _14> and N5 <_15, _16, _15, _16> to save the need of extract/splat.

3) Further we can detect the pattern on SLP graph and skip the need to gen 
N4 and N5, but gen <_13, _14, _15, _16> and repeat the minus/plus and permutations for the required t1/t2/t3/t4 computation.

> Btw, on current trunk the simplified testcase no longer runs into the
> 'scalar operand' build case but of course vectorization is thought to be
> not profitable.  pattern recog of the plus/minus subgraphs may help
> (not sure if ppc has those as instruction, x86 has).

You meant the simplified case in #c4? I re-based the latest trunk (r11-6601) but still suffered the 'scalar operand' build, unfortunately ppc doesn't support that kind of instructions, sigh.

> 
> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

Thanks for the information!

Comment 8 Kewen Lin 2021-01-12 07:25:37 UTC

Created attachment 49942 [details]
vectorized with altivec built-in functions

Comment 9 Richard Biener 2021-08-04 10:31:15 UTC

The full satd_8x4 looks like the following, the 2nd loop isn't to be
disregarded

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

static inline uint32_t abs2( uint32_t a )
{
    uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
    }
    for( int i = 0; i < 4; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] );
        sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }
    return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}

Comment 10 ktkachov 2022-07-06 10:39:55 UTC

Note that current clang does a pretty decent job on this now on aarch64 (in case it gives some inspiration on the approach)
https://godbolt.org/z/EPvqMhh7v

Comment 11 Manolis Tsamis 2023-02-01 08:19:42 UTC

> The full satd_8x4 looks like the following, the 2nd loop isn't to be
> disregarded

Regarding the second loop, this patch https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608827.html should result in improved vectorization and performance.

Currently ((a>>15)&0x10001)*0xffff from abs2 is calculated using 4 vector operations (shift, bitand, shift+sub for the multiplication) whereas with the proposed patch this will be a single vector compare operation.

Comment 12 Jiangning Liu 2023-10-04 22:37:02 UTC

Hi Richi,

> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

It was long time ago, so do you have the "best" approach now?

Thanks,
-Jiangning

Comment 13 Richard Biener 2023-10-05 06:58:01 UTC

(In reply to Jiangning Liu from comment #12)
> Hi Richi,
> 
> > That said, "failure" to identify the common (vector) load is known
> > and I do have experimental patches trying to address that but did
> > not yet arrive at a conclusive "best" approach.
> 
> It was long time ago, so do you have the "best" approach now?

I do have ideas but how it will play out is unknown.  Also the current
priority is to improve maintainability of the code with getting rid of
the non-SLP data structure using paths in the vectorizer.

Next would be to redo loop SLP discovery from scratch which would include
possibly fixing this issue.

> Thanks,
> -Jiangning

Comment 14 Richard Biener 2023-10-09 07:38:04 UTC

Btw, previous work is at refs/users/rguenth/heads/load-perm