Bug 86504 - vectorization failure for a nest loop
Summary: vectorization failure for a nest loop
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: 10.0
Assignee: Tamar Christina
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-07-12 07:35 UTC by Jiangning Liu
Modified: 2019-11-21 16:02 UTC (History)
4 users (show)

See Also:
Host:
Target: arm aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-12 00:00:00


Attachments
bad vectorizatoin result for boundary size 16 (667 bytes, text/plain)
2018-07-12 07:35 UTC, Jiangning Liu
Details
bad vectorizatoin result for boundary size 8 (641 bytes, text/plain)
2018-07-12 07:38 UTC, Jiangning Liu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jiangning Liu 2018-07-12 07:35:15 UTC
Created attachment 44386 [details]
bad vectorizatoin result for boundary size 16

For the case below, the code generated by “gcc -O3” is very ugly, and the inner loop can be correctly vectorized. Please refer to attached file test_loop_inner_16.s.

char g_d[1024], g_s1[1024], g_s2[1024];
void test_loop(void)
{
    char *d = g_d, *s1 = g_s1, *s2 = g_s2;

    for ( int y = 0; y < 128; y++ )
    {
        for ( int x = 0; x < 16; x++ )
            d[x] = s1[x] + s2[x];
        d += 16;
    }
}

If we change inner loop “for ( int x = 0; x < 16; x++ )” to be like “for ( int x = 0; x < 32; x++ )”, i.e. the loop boundary size changes from 16 to 32, very beautiful vectorization code would be generated. For example, the code below is the aarch64 result for loop boundary size 32, and it the same case for x86.

test_loop:
.LFB0:
        .cfi_startproc
        adrp    x2, g_s1
        adrp    x3, g_s2
        add     x2, x2, :lo12:g_s1
        add     x3, x3, :lo12:g_s2
        adrp    x0, g_d
        adrp    x1, g_d+2048
        add     x0, x0, :lo12:g_d
        add     x1, x1, :lo12:g_d+2048
        ldp     q1, q2, [x2]
        ldp     q3, q0, [x3]
        add     v1.16b, v1.16b, v3.16b
        add     v0.16b, v0.16b, v2.16b
        .p2align 3,,7
.L2:
        str     q1, [x0]
        str     q0, [x0, 16]!
        cmp     x0, x1
        bne     .L2
        ret

The code generated for loop boundary size 8 is also very bad. 

Any idea?
Comment 1 Jiangning Liu 2018-07-12 07:38:39 UTC
Created attachment 44387 [details]
bad vectorizatoin result for boundary size 8
Comment 2 ktkachov 2018-07-12 08:15:26 UTC
Confirmed, I've seen this in other similar loops. Thanks for filing a bug report for this
Comment 3 Richard Biener 2018-07-12 09:00:15 UTC
The issue is the inner complete loop unrolling pass which unrolls loops up to 16 times (a --param controls that number).  You can get good code via -fdisable-tree-cunrolli for example.

So the vectorization issue would be that basic-block vectorization doesn't
catch this in a very nice way - on x86 we pull out the invariant computation
and have a vectorized (outer) loop storing to d.  But we fail to
vectorize the add because we are restricted to a single basic-block and the
stores are still in the inner loop (obviously):

t.c:9:15: note: not vectorized: no grouped stores in basic block.

instead we see

  _238 = MEM[(char *)&g_s2 + 15B];
  _239 = (unsigned char) _238;
  _240 = _236 + _239;
  _242 = (char) _240;
  _234 = {_32, _46, _60, _74, _88, _102, _116, _130, _144, _158, _172, _186, _200, _214, _228, _242};
  vect_cst__237 = _234;

  <bb 3> [local count: 63136020]:
  # vectp_g_d.0_227 = PHI <vectp_g_d.0_15(5), &g_d(2)>
  # ivtmp_31 = PHI <ivtmp_241(5), 0(2)>
  MEM[(char *)vectp_g_d.0_227] = vect_cst__237;
  vectp_g_d.0_15 = vectp_g_d.0_227 + 16;
  ivtmp_241 = ivtmp_31 + 1;
  if (ivtmp_241 < 128)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 5> [local count: 62498283]:
  goto <bb 3>; [100.00%]

  <bb 4> [local count: 637738]:
  return;

so this is a duplicate of the bug that says BB vectorization doesn't consider
a vector CONSTRUCTOR as sink.
Comment 4 rsandifo@gcc.gnu.org 2018-07-20 08:20:19 UTC
About to post a patch.
Comment 5 Tamar Christina 2019-04-09 18:11:49 UTC
Hi Richard,

Do you still plan on working on this? Otherwise I'd like to take it over for GCC10.
Comment 6 rsandifo@gcc.gnu.org 2019-04-10 08:02:58 UTC
(In reply to Tamar Christina from comment #5)
> Hi Richard,
> 
> Do you still plan on working on this? Otherwise I'd like to take it over for
> GCC10.

I've already committed the patch mentioned in #c4 (can't remember
which patch it was offhand) but it turned out not to fix this PR
after all.  So please go ahead.
Comment 7 Tamar Christina 2019-04-10 09:16:35 UTC
Cheers, Thanks Richard, I'll grab this one too then.
Comment 8 Joel Hutton 2019-07-30 16:41:42 UTC
(In reply to Richard Biener from comment #3)

Hi Richard,

> So the vectorization issue would be that basic-block vectorization doesn't
> catch this in a very nice way - on x86 we pull out the invariant computation
> and have a vectorized (outer) loop storing to d.

Just a small clarification, do you mean to say that there is a difference between the way x86 and aarch64 handle this, as far as I can see they handle this in the same way.
Comment 9 Richard Biener 2019-07-31 07:14:54 UTC
(In reply to Joel Hutton from comment #8)
> (In reply to Richard Biener from comment #3)
> 
> Hi Richard,
> 
> > So the vectorization issue would be that basic-block vectorization doesn't
> > catch this in a very nice way - on x86 we pull out the invariant computation
> > and have a vectorized (outer) loop storing to d.
> 
> Just a small clarification, do you mean to say that there is a difference
> between the way x86 and aarch64 handle this, as far as I can see they handle
> this in the same way.

No, I was just mentioning I inspected this on x86 because I do expect the
handling to be the same.
Comment 10 Joel Hutton 2019-11-21 15:58:38 UTC
Should be fixed on trunk
Comment 11 Joel Hutton 2019-11-21 15:59:05 UTC
Should be fixed on trunk by r277784
Comment 12 Tamar Christina 2019-11-21 16:02:08 UTC
Fixed in GCC 10.