[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

Fri Jan 4 16:41:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #11 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> If the compiler knew say from PGO that pos is usually a multiple of certain
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself and
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p = cur + pos;
>   int len = 0;
>   if (max > 32 && (pos & 7) == 0)
>     {
>       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
>       while (++len != l)
>         if (p[len] != cur[len])
>           goto end;
>       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> long *) &p[len];
>       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> long *) &cur[len];
>       while (len + 8 < max)
>         {
>           if (*p2++ != *cur2++)
>             break;
>           len += 8;
>         }
>       --len;
>     }
>   while (++len != max)
>     if (p[len] != cur[len])
>       break;
> end:
>   return cur[len];
> }
> 
> or so (untested).  Of course, it could be done using SIMD too if there is a
> way to terminate the loop if any of the elts is different and could be done
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would just
> waste code size, dealing with the unaligned cases would be more complicated
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

Given it is compressing data both pointers will typically be misaligned. Pos
would be fairly random and len does not usually start from zero either. And
it's highly unlikely it will iterate more than a few bytes. Compression matches
are typically just a few bytes long. So unrolling is the only useful
optimization for this kind of code.