[Bug tree-optimization/101097] Vectorizer is too eager to use vec_unpack

Thu Jun 17 06:38:10 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101097

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-* i?86-*-*
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-06-17

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, so the difference is that we use loop vect for 'foo' but fail to do that
for 'bar' and BB vect succeeds.  Disabling loop vect but enabling BB vect also
produces optimal code for 'foo' (unrolling happens before):

foo:
.LFB0:
        .cfi_startproc
        vpmovzxwd       (%rsi), %ymm0
        vpmovzxwd       (%rdi), %ymm1
        vpaddd  %ymm1, %ymm0, %ymm0
        vmovdqu %ymm0, (%rdx)
        vzeroupper

the key difference in the vectorizer is that BB vect supports different
vector sizes in the same instance but the loop vectorizer can only use
a single vector size.

There's some related PRs in that context.

void
foo (unsigned short* p1, unsigned short* p2, int* __restrict p3)
{
    for (int i = 0 ; i != 32; i++)
     p3[i] = p1[i] + p2[i];
     return;
}

is never optimized optimally because of too many iterations for unrolling
to trigger (--parm max-completely-peel-times default is 16).

Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations