Bug 106343 - SLP does not support no-op case
Summary: SLP does not support no-op case
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 112579 (view as bug list)
Depends on:
Blocks: vectorizer 112325
  Show dependency treegraph
 
Reported: 2022-07-18 15:51 UTC by Manolis Tsamis
Modified: 2023-11-17 06:21 UTC (History)
7 users (show)

See Also:
Host:
Target: aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-07-18 00:00:00


Attachments
Does not vectorize (269 bytes, text/plain)
2022-07-18 15:51 UTC, Manolis Tsamis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Manolis Tsamis 2022-07-18 15:51:09 UTC
Created attachment 53316 [details]
Does not vectorize

The following test case:

  void foo (uint32_t dst[8], uint8_t src1[8], uint8_t src2[8])
  {
    uint16_t diff_e0 = src1[0] - src2[0];
    uint16_t diff_e1 = src1[1] - src2[1];
    uint16_t diff_e2 = src1[2] - src2[2];
    uint16_t diff_e3 = src1[3] - src2[3];
    uint16_t diff_e4 = src1[4] - src2[4];
    uint16_t diff_e5 = src1[5] - src2[5];
    uint16_t diff_e6 = src1[6] - src2[6];
    uint16_t diff_e7 = src1[7] - src2[7];

    uint32_t a0 = diff_e0 + 1;
    uint32_t a1 = diff_e1 + 3;
    uint32_t a2 = diff_e2 + 4;
    uint32_t a3 = diff_e3 + 2;
    uint32_t a4 = diff_e4 + 12;
    uint32_t a5 = diff_e5 + 11;
    uint32_t a6 = diff_e6 + 9;
    uint32_t a7 = diff_e7 + 3;

    dst[0] = a0;
    dst[1] = a1;
    dst[2] = a2;
    dst[3] = a3;
    dst[4] = a4;
    dst[5] = a5;
    dst[6] = a6;
    dst[7] = a7;
  }

Produces nice vectorized code on aarch64:

  ldr     d2, [x2]
  adrp    x3, .LC0
  ldr     d0, [x1]
  ldr     q1, [x3, #:lo12:.LC0]
  usubl   v0.8h, v0.8b, v2.8b
  uaddl   v2.4s, v0.4h, v1.4h
  uaddl2  v0.4s, v0.8h, v1.8h
  stp     q2, q0, [x0]
  ret

But if any of the constants is replaced with zero instead then scalar code is produced:

  ldrb    w4, [x2, 1]
  ldrb    w8, [x1, 1]
  ldrb    w3, [x2, 3]
  ldrb    w7, [x1, 3]
  sub     w8, w8, w4
  ldrb    w6, [x1, 4]
  and     w8, w8, 65535
  ldrb    w4, [x2, 4]
  sub     w7, w7, w3
  ldrb    w5, [x1, 5]
  and     w7, w7, 65535
  ldrb    w3, [x2, 5]
  sub     w6, w6, w4
  ldrb    w9, [x2, 6]
  and     w6, w6, 65535
  ldrb    w4, [x1, 6]
  sub     w5, w5, w3
  ldrb    w10, [x2, 7]
  and     w5, w5, 65535
  ldrb    w3, [x1, 7]
  sub     w4, w4, w9
  ldrb    w11, [x2]
  and     w4, w4, 65535
  ldrb    w9, [x1]
  sub     w3, w3, w10
  ldrb    w2, [x2, 2]
  add     w8, w8, 3
  ldrb    w10, [x1, 2]
  sub     w9, w9, w11
  and     w1, w3, 65535
  and     w9, w9, 65535
  sub     w10, w10, w2
  add     w3, w5, 11
  add     w2, w4, 9
  add     w7, w7, 2
  add     w6, w6, 12
  add     w1, w1, 3
  add     w4, w9, 1
  and     w5, w10, 65535
  stp     w4, w8, [x0]
  stp     w5, w7, [x0, 8]
  stp     w6, w3, [x0, 16]
  stp     w2, w1, [x0, 24]
  ret

It would be possible to produce the same vectorized code as above but with zero in the constants. If I understand correctly, the identity element of addition is not taken into consideration in the SLP vectorizer, which could be improved. The same happens with subtraction.

I can reproduce this in any recent version of GCC (e.g. >= 10).

Vectorized case: https://godbolt.org/z/5sbb1an89
Scalar case:     https://godbolt.org/z/v8jPT9jEe
Comment 1 ktkachov 2022-07-18 16:07:50 UTC
Confirmed, it's quite odd. x86_64 is also affected:
https://godbolt.org/z/q46z3hh9Y
Comment 2 Andrew Pinski 2022-07-18 16:12:19 UTC
Also an issue with multiply:

void foo (unsigned *__restrict dst, unsigned *__restrict src1)
{
    dst[0] = src1[0] * 1;
    dst[1] = src1[1] * 2;
    dst[2] = src1[2] * 3;
    dst[3] = src1[3] * 4;
    dst[4] = src1[4] * 5;
    dst[5] = src1[5] * 6;
    dst[6] = src1[6] * 7;
    dst[7] = src1[7] * 8;
}
Comment 3 Andrew Pinski 2022-07-18 16:20:04 UTC
I should note that I noticed LLVM does not handle this either.

Basically the following operators and values can be used:
For integer:
+ 0
- 0
* 1
/ 1
| 0
& -1 (all ones)
^ 0

For floating point (only with -ffast-math, I think sub can be used with 0 and add with -0.0 without but I am not 100% sure):
+ 0
- 0
* 1
/ 1
Comment 4 Andrew Pinski 2022-07-18 16:24:28 UTC
(In reply to Andrew Pinski from comment #3)
> I should note that I noticed LLVM does not handle this either.
> 
> Basically the following operators and values can be used:
> For integer:
> + 0
> - 0
> * 1
> / 1
> | 0
> & -1 (all ones)
> ^ 0
> 
> For floating point (only with -ffast-math, I think sub can be used with 0
> and add with -0.0 without but I am not 100% sure):
> + 0
> - 0
> * 1
> / 1

Note for the following operators can support some constants which were there instead of a calculation (note this might be harder and maybe a different bug):
op cst   rhs
*   0     0
|  -1    -1
&   0     0
Comment 5 Erick Ochoa 2022-07-18 17:42:03 UTC
(In reply to Andrew Pinski from comment #3)
> I should note that I noticed LLVM does not handle this either.
> 
> Basically the following operators and values can be used:
> For integer:
> + 0
> - 0
> * 1
> / 1
> | 0
> & -1 (all ones)
> ^ 0
> 
> For floating point (only with -ffast-math, I think sub can be used with 0
> and add with -0.0 without but I am not 100% sure):
> + 0
> - 0
> * 1
> / 1

I think it should be possible to also consider the bit-shifts operations:

>> 0
<< 0
Comment 6 Andrew Pinski 2022-07-18 17:48:09 UTC
(In reply to Erick Ochoa from comment #5)
> I think it should be possible to also consider the bit-shifts operations:
> 
> >> 0
> << 0

Yes and rotates too.
Comment 7 Richard Biener 2022-07-19 06:44:03 UTC
SLP discovery doesn't support this (and there's for sure some duplicate bug about this).  Note that SLP discovery currently does a greedy search from the stores and it commits to the first "working" graph (where "working" differs from loop vs. non-loop operation), opening up more "fixup" possibilities will also open up the chance for it to de-rail more easily.
Comment 8 Andrew Pinski 2023-11-17 06:19:48 UTC
*** Bug 112579 has been marked as a duplicate of this bug. ***