[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

Tue Jan 8 14:19:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 8 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> 
> --- Comment #15 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #14)
> > On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> > > 
> > > --- Comment #13 from Wilco <wilco at gcc dot gnu.org> ---
> > > So to add some real numbers to the discussion, the average number of iterations
> > > is 4.31. Frequency stats (16 includes all iterations > 16 too):
> > > 
> > > 1: 29.0
> > > 2: 4.2
> > > 3: 1.0
> > > 4: 36.7
> > > 5: 8.7
> > > 6: 3.4
> > > 7: 3.0
> > > 8: 2.6
> > > 9: 2.1
> > > 10: 1.9
> > > 11: 1.6
> > > 12: 1.2
> > > 13: 0.9
> > > 14: 0.8
> > > 15: 0.7
> > > 16: 2.1
> > > 
> > > So unrolling 4x is perfect for this loop. Note the official xz version has
> > > optimized this loop since 2014(!) using unaligned accesses:
> > > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
> > 
> > I guess if we'd have some data to guide then classical unrolling using
> > duffs device would be best here?  Because peeling will increase the
> > number of dynamic branches and likely the actual distribution of
> > #iterations isn't so that they will be well predicted?
> 
> Duff's device is very slow on any CPU with branch prediction. It can't be used
> here given you don't know the number of iterations in advance (only the
> maximum).
> 
> Unrolling reduces the number of branches given the loop branch is only taken
> once every N iterations. The loop overhead is significant here: 3 out of 7
> instructions, with 4x unrolling that reduces to 3 in 19.

But you can't elide the checks in the peeled copies and for 4-times
unrolling you have most cases exiting on the first or fourth check.

Duffs device simply merges the prologue iterations for unrolling
with the loop body so I don't see why it can't be used.  It's

  switch (n % 4)
   {
    loop:
       iter
       n--;
    case 3:
       iter
       n--;
    case 2:
       iter
       n--
    case 1:
       iter
       n--;
    case 0:
       if (n != 0)
         goto loop;
   } 

it's cost is mainly the computed jump into the loop body.  Then
you have a four-fold reduction in branches without the overhead
of having another three loop body copies in the prologue with
retained early exit checks.