Created attachment 43148 [details] benchmark shifts of vector builtins with 8-bit integral element type can be optimized better. I.e. `v << n` can be implemented as 1. load 0x00ff00ff00ff... and 16-bit shift by n 2. xor (1) with 0xff00ff00ff00... to produce a bitmask 3. 16-bit shift v by n 4. bitwise and of (2) and (3) I'll attach a benchmark with an intrinsics based implementation.
Created attachment 43149 [details] tsc.h Header required for the benchmark code.
I compiled with: g++-7 -march=haswell -std=c++17 -O3 -flax-vector-conversions -o char_shift char_shift.cpp