[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

Mon Jan 7 13:58:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> 
> --- Comment #13 from Wilco <wilco at gcc dot gnu.org> ---
> So to add some real numbers to the discussion, the average number of iterations
> is 4.31. Frequency stats (16 includes all iterations > 16 too):
> 
> 1: 29.0
> 2: 4.2
> 3: 1.0
> 4: 36.7
> 5: 8.7
> 6: 3.4
> 7: 3.0
> 8: 2.6
> 9: 2.1
> 10: 1.9
> 11: 1.6
> 12: 1.2
> 13: 0.9
> 14: 0.8
> 15: 0.7
> 16: 2.1
> 
> So unrolling 4x is perfect for this loop. Note the official xz version has
> optimized this loop since 2014(!) using unaligned accesses:
> https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h

I guess if we'd have some data to guide then classical unrolling using
duffs device would be best here?  Because peeling will increase the
number of dynamic branches and likely the actual distribution of
#iterations isn't so that they will be well predicted?