1 static inline void pixel_avg( uint8_t *dst, int i_dst_stride,
2 uint8_t *src1, int i_src1_stride,
3 uint8_t *src2, int i_src2_stride,
4 int i_width, int i_height )
6 for( int y = 0; y < i_height; y++ )
8 for( int x = 0; x < i_width; x++ )
9 dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
10 dst += i_dst_stride;
11 src1 += i_src1_stride;
12 src2 += i_src2_stride;
If above code is in hot loop.
if i_width value is between 16 and 32, -mprefer-vector-width=128 can provide ~6% performance improvement as compared to -mprefer-vector-width=256.
i_width value must be at least 16 to trigger 128 bit vectorization at line 8.
i_width value must be at least 32 to trigger 256 bit vectorization at line 8.
void foo(int row, int k, int h)
/* Variable nrow range from 4 to 9. */
int nrow = ((row - 1)/3 + 1)*3 + 1;
for (int i = nrow; i < 9; i++)
block[k][h][i] = block[k][h][i] - 10;
Since nrow range from 4 to 9, 256bit vector operation will never be
executed(vector elements always less than 8), so 256bit vector actually
equals no vectorization plus additional branch cost. Even with epilogue
vectorization, 256bit vector still has more overhead. When this is a hot
function, 256bit vector can reduce performance by 6%.
When loop trip count is known, vectorizer won't select 256-bit vector when
266-bit vector can't be used. When loop trip count is unknown, 256-bit
vector can be slower than 128-bit vector, depending on workloads. In
case of SPEC CPU 2017, 128-bit vector is much faster than 256-bit vector
for a couple benchmarks. For most of benchmarks, there are no performance
differences between 128-bit vector and 256-bit vector.
I believe we have a duplicate PR for exactly this case and Andre is working on
Yes this looks like a duplicate of PR 88915. I'll mark it as such.
*** This bug has been marked as a duplicate of bug 88915 ***