525.x264_r has typedef unsigned char uint8_t; void mc_chroma( uint8_t *dst, int i_dst_stride, uint8_t *src, int i_src_stride, int mvx, int mvy, int i_width, int i_height ) { uint8_t *srcp; int d8x = mvx&0x07; int d8y = mvy&0x07; int cA = (8-d8x)*(8-d8y); int cB = d8x *(8-d8y); int cC = (8-d8x)*d8y; int cD = d8x *d8y; src += (mvy >> 3) * i_src_stride + (mvx >> 3); srcp = &src[i_src_stride]; for( int y = 0; y < i_height; y++ ) { for( int x = 0; x < i_width; x++ ) dst[x] = ( cA*src[x] + cB*src[x+1] + cC*srcp[x] + cD*srcp[x+1] + 32 ) >> 6; dst += i_dst_stride; src = srcp; srcp += i_src_stride; } } where the inner loop could use two dot_prodvNhiv2Nqi - iff we had a SLP pattern recognizing this and iff we'd narrow the invariants to [us]char (pattern recog demotes the multiply to HImode, range info on c[ABCD] indicates they fit in QImode). And iff we'd nail down which lanes get summed for dot_prod (other related summing optabs have the same unsepcifiedness here, making them only useful for reductions where we end up summing the result lanes).
Other "failures" for this loop nest is choosing an alias versioning check that is invariant in the outer loop so we'd version that instead of just the inner loop. We also fail to implement (or schedule) unswitching of the outer loop on inner loop niter checks the vectorizer inserts for costing and jumps around vector/epilog. For perfect nests (and outer loops without actual code) the vectorizer could consider unswitching the initial epilog entry conditions itself (at the expense of code size of course).