Bug 88259 - vectorization failure for a typical loop for getting max value and index
Summary: vectorization failure for a typical loop for getting max value and index
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Tamar Christina
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2018-11-29 08:39 UTC by Jiangning Liu
Modified: 2021-03-11 11:13 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-11-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jiangning Liu 2018-11-29 08:39:11 UTC
GCC -O3 can't vectorize the following typical loop for getting max value and index from an array.

void test_vec(int *data, int n) {
        int best_i, best = 0;

        for (int i = 0; i < n; i++) {
                if (data[i] > best) {
                        best = data[i];
                        best_i = i;
                }
        }

        data[best_i] = data[0];
        data[0] = best;
}

The code generated in the kernel loop is as below,

.L4:
        ldr     w4, [x0, x2, lsl 2]
        cmp     w3, w4
        csel    w6, w4, w3, lt
        csel    w5, w2, w5, lt
        add     x2, x2, 1
        mov     w3, w6
        cmp     w1, w2
        bgt     .L4

If n is a constant like 1024, gcc -O3 still fails to vectorize it.

If we only get the max value and keep only one statement in the if statement inside the loop,

void test_vec(int *data, int n) {
        int best = 0;
        for (int i = 0; i < n; i++) {
                if (data[i] > best) {
                        best = data[i];
                }
        }

        data[0] = best;
}

"gcc -O3" can do vectorization and the kernel loop is like below,

.L4:
        ldr     q1, [x2], 16
        smax    v0.4s, v0.4s, v1.4s
        cmp     x2, x3
        bne     .L4
Comment 1 ktkachov 2018-11-29 09:20:24 UTC
Confirmed.
Trying to find just the index (not the max value) vectorises as well:
void test_vec(int *data, int n) {
        int best_i, best = 0;

        for (int i = 0; i < n; i++) {
                if (data[i] > best) {
//                        best = data[i];
                        best_i = i;
                }
        }

        data[best_i] = data[0];
        data[0] = best;
}


-O3:
.L4:
        ldr     q1, [x2], 16
        mov     v3.16b, v2.16b
        add     v2.4s, v2.4s, v4.4s
        cmle    v1.4s, v1.4s, #0
        cmp     x2, x3
        bif     v0.16b, v3.16b, v1.16b
        bne     .L4
        smaxv   s0, v0.4s
        and     w3, w1, -4
        umov    w2, v0.s[0]
        cmn     w2, #1
        csel    w2, w2, wzr, ne
        tst     x1, 3
        beq     .L2
.L3:

But their combination seems like it's throwing the machinery off. I'm guessing the index-finding needs some if-conversion and masking to happen in the vectoriser
Comment 2 Ramana Radhakrishnan 2018-11-29 09:24:48 UTC
(In reply to ktkachov from comment #1)
> Confirmed.
> Trying to find just the index (not the max value) vectorises as well:
> void test_vec(int *data, int n) {
>         int best_i, best = 0;
> 
>         for (int i = 0; i < n; i++) {
>                 if (data[i] > best) {
> //                        best = data[i];
>                         best_i = i;
>                 }
>         }
> 
>         data[best_i] = data[0];
>         data[0] = best;
> }
> 
> 
> -O3:
> .L4:
>         ldr     q1, [x2], 16
>         mov     v3.16b, v2.16b
>         add     v2.4s, v2.4s, v4.4s
>         cmle    v1.4s, v1.4s, #0
>         cmp     x2, x3
>         bif     v0.16b, v3.16b, v1.16b
>         bne     .L4
>         smaxv   s0, v0.4s
>         and     w3, w1, -4
>         umov    w2, v0.s[0]
>         cmn     w2, #1
>         csel    w2, w2, wzr, ne
>         tst     x1, 3
>         beq     .L2
> .L3:
> 
> But their combination seems like it's throwing the machinery off. I'm
> guessing the index-finding needs some if-conversion and masking to happen in
> the vectoriser

ISTR there is some limit in if conversion around the vectorizer where it only works on very simple if-blocks. But this is from memory and it's a bit fuzzy now.
Comment 3 Richard Biener 2018-11-29 10:00:41 UTC
The vectorizer does not like

  <bb 3> [local count: 955630224]:
  # best_i_25 = PHI <best_i_11(8), best_i_16(D)(18)>
  # best_26 = PHI <best_13(8), 0(18)>
  # i_27 = PHI <i_20(8), 0(18)>
  _1 = (long unsigned int) i_27;
  _2 = _1 * 4;
  _3 = data_18(D) + _2;
  _4 = *_3;
  best_i_11 = _4 <= best_26 ? best_i_25 : i_27;
  best_13 = MAX_EXPR <_4, best_26>;
  i_20 = i_27 + 1;
  if (n_17(D) > i_20)

because for the best MAX reduction we have an additional use of the
reduction value in the index reduction.  This combination isn't
magically supported even though in isolation both cases are.

t.c:4:5: note:   Analyze phi: best_26 = PHI <best_13(8), 0(18)>
t.c:4:5: missed:   reduction used in loop.
t.c:4:5: missed:   Unknown def-use cycle pattern.
t.c:4:5: note:   Analyze phi: best_i_25 = PHI <best_i_11(8), best_i_16(D)(18)>
t.c:4:5: note:   detected reduction: need to swap operands: best_i_11 = _4 > best_26 ? i_27 : best_i_25;
t.c:4:5: note:   Detected reduction.

if we'd been lucky and had analyzed best_i_25 before best_26 then we could
probably special-case the case of "reduction used in loop" when that appears
in other reductions.  In general that's of course still not valid I think.

Alternatively the reduction operation could be combined somehow via
pattern detection.
Comment 4 rsandifo@gcc.gnu.org 2018-12-07 15:56:56 UTC
(In reply to Richard Biener from comment #3)
> The vectorizer does not like
> 
>   <bb 3> [local count: 955630224]:
>   # best_i_25 = PHI <best_i_11(8), best_i_16(D)(18)>
>   # best_26 = PHI <best_13(8), 0(18)>
>   # i_27 = PHI <i_20(8), 0(18)>
>   _1 = (long unsigned int) i_27;
>   _2 = _1 * 4;
>   _3 = data_18(D) + _2;
>   _4 = *_3;
>   best_i_11 = _4 <= best_26 ? best_i_25 : i_27;
>   best_13 = MAX_EXPR <_4, best_26>;
>   i_20 = i_27 + 1;
>   if (n_17(D) > i_20)
> 
> because for the best MAX reduction we have an additional use of the
> reduction value in the index reduction.  This combination isn't
> magically supported even though in isolation both cases are.
> 
> t.c:4:5: note:   Analyze phi: best_26 = PHI <best_13(8), 0(18)>
> t.c:4:5: missed:   reduction used in loop.
> t.c:4:5: missed:   Unknown def-use cycle pattern.
> t.c:4:5: note:   Analyze phi: best_i_25 = PHI <best_i_11(8),
> best_i_16(D)(18)>
> t.c:4:5: note:   detected reduction: need to swap operands: best_i_11 = _4 >
> best_26 ? i_27 : best_i_25;
> t.c:4:5: note:   Detected reduction.
> 
> if we'd been lucky and had analyzed best_i_25 before best_26 then we could
> probably special-case the case of "reduction used in loop" when that appears
> in other reductions.  In general that's of course still not valid I think.
Yeah.  Disabling the check for uses in the loop:

  /* If this isn't a nested cycle or if the nested cycle reduction value
     is used ouside of the inner loop we cannot handle uses of the reduction
     value.  */
  if ((!nested_in_vect_loop || inner_loop_of_double_reduc)
      && (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1))

gives us something like the vector body we want, modulo some
inefficiency:

.L4:
        ldr     q4, [x2], 16
        mov     v3.16b, v2.16b
        add     v2.4s, v2.4s, v6.4s
        cmge    v5.4s, v0.4s, v4.4s
        cmp     x3, x2
        smax    v0.4s, v0.4s, v4.4s
        bif     v1.16b, v3.16b, v5.16b
        bne     .L4

where v0.4s ends up containing the maximum for each individual
lane and v1.s contains the best_i associated with each member
of v0.4s.  We "just" then need to make the epilogue do the
right thing with this information.

Hacking out the condition above (obviously an invalid thing
to do) sets "best" to the maximum of v0.s (good) but also sets
"best_i" to the maximum of v1.s (bad).  We need to restrict the
maximum of v1.s to lanes of v0.s that contain "best" (i.e. the
reduction result of v0.s):

        dup    v2.4s, best
        cmpeq  v2.4s, v2.4s, v0.4s
        and    v1.4s, v1.4s, v2.4s

and only then take the maximum of v1.4s.

This requires "best" to come from a reassociatve conditional
reduction and would require the "best_i" reduction to be marked
as dependent on the "best" reduction.  Might end up being a bit
messy, since we'd have to be careful to retain the uses check
above for all other cases.
Comment 5 Tamar Christina 2019-04-09 18:13:02 UTC
I'll be taking a look at this one as a part of GCC 10 as well.
Comment 6 rguenther@suse.de 2019-04-10 08:08:28 UTC
On Tue, 9 Apr 2019, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259
> 
> Tamar Christina <tnfchris at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |ASSIGNED
>                  CC|                            |tnfchris at gcc dot gnu.org
>            Assignee|unassigned at gcc dot gnu.org      |tnfchris at gcc dot gnu.org
> 
> --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> I'll be taking a look at this one as a part of GCC 10 as well.

Note that ripping out non-SLP support from the vectorizer will turn
reduction support upside down ... which means the work will heavily
conflict, either me or you needing to re-do stuff.
Comment 7 Tamar Christina 2019-04-10 09:19:38 UTC
> Note that ripping out non-SLP support from the vectorizer will turn
> reduction support upside down ... which means the work will heavily
> conflict, either me or you needing to re-do stuff.

Fair enough, thanks for the heads up!