Created attachment 47231 [details] testcase cut from 525.x264_r for simple testcase cat test.c: typedef unsigned char uint8_t; static inline uint8_t x264_clip_uint8( int x ) { return x&(~63) ? (-x)>>7 : x; } void mc_weight( uint8_t *dst, uint8_t *src, int i_width, int i_height ) { for( int x = 0; x < i_width; x++ ) dst[x] = x264_clip_uint8(src[x]); } Refer to https://godbolt.org/z/TnACA- Icc generate much better code by using vptestmd and maskmov gcc loop: ------------ vmovdqu8 (%rsi,%rax), %ymm6 vpmovzxbw %xmm6, %ymm3 vpmovzxwd %xmm3, %ymm0 vextracti128 $0x1, %ymm3, %xmm3 vpand %ymm5, %ymm0, %ymm9 vpmovzxwd %xmm3, %ymm3 vextracti128 $0x1, %ymm6, %xmm2 vpcmpd $0, %ymm4, %ymm9, %k0 vpmovzxbw %xmm2, %ymm2 vpand %ymm5, %ymm3, %ymm9 vpcmpd $0, %ymm4, %ymm9, %k1 vpmovzxwd %xmm2, %ymm1 vextracti128 $0x1, %ymm2, %xmm2 vpand %ymm5, %ymm1, %ymm9 vpmovzxwd %xmm2, %ymm2 vpcmpd $0, %ymm4, %ymm9, %k2 vpand %ymm5, %ymm2, %ymm9 kunpckbw %k0, %k1, %k0 vpcmpd $0, %ymm4, %ymm9, %k1 vpsubd %ymm0, %ymm4, %ymm0 vpsubd %ymm3, %ymm4, %ymm3 vpsubd %ymm1, %ymm4, %ymm1 vpsubd %ymm2, %ymm4, %ymm2 kunpckbw %k2, %k1, %k1 kunpckwd %k0, %k1, %k1 vpermt2w %ymm3, %ymm8, %ymm0 vpermt2w %ymm2, %ymm8, %ymm1 vpsraw $7, %ymm0, %ymm0 vpsraw $7, %ymm1, %ymm1 vpand %ymm0, %ymm7, %ymm0 vpand %ymm1, %ymm7, %ymm1 vpackuswb %ymm1, %ymm0, %ymm0 vpermq $216, %ymm0, %ymm0 vmovdqu8 %ymm6, %ymm0{%k1} vmovdqu8 %ymm0, (%rdi,%rax) addq $32, %rax cmpq %rcx, %rax ------------ icc loop: ---------- vpmovzxbd (%rsi,%r8), %ymm3 #12.31 vptestmd %ymm1, %ymm3, %k1 #5.12 vpsubd %ymm3, %ymm0, %ymm2 #12.31 vpsrad $7, %ymm2, %ymm3{%k1} #12.31 vpmovdb %ymm3, (%r8,%rdi) #12.6 addq $8, %r8 #11.2 cmpq %rcx, %r8 #11.2 jb ..B1.7 # Prob 82% #11.2 ---------- origin case cut from SPEC2017 525.x264_r, refer to attachment
Much more simple case, exclude disturb of point alias and unknown loop count cat test.c: typedef unsigned char uint8_t; static inline uint8_t x264_clip_uint8( int x ) { return x&(~63) ? (-x)>>7 : x; } void mc_weight( uint8_t *__restrict dst, uint8_t *__restrict src) { for( int x = 0; x < 16; x++ ) dst[x] = x264_clip_uint8(src[x]); } Refer to https://godbolt.org/z/YJXWRD Gcc failed to vectorize loop, icc succeed.
t.c:12:14: missed: not vectorized: relevant stmt not supported: _4 = (int) _3; # x_25 = PHI <x_13(7), 0(15)> # ivtmp_29 = PHI <ivtmp_28(7), 16(15)> _1 = (sizetype) x_25; _2 = src_9(D) + _1; _3 = *_2; _4 = (int) _3; _5 = dst_10(D) + _1; _11 = _4 & -64; _14 = -_4; _15 = _14 >> 7; iftmp.0_16 = (unsigned char) _15; iftmp.0_17 = _11 == 0 ? _3 : iftmp.0_16; *_5 = iftmp.0_17; x_13 = x_25 + 1; ivtmp_28 = ivtmp_29 - 1; if (ivtmp_28 != 0) goto <bb 7>; [93.75%] else goto <bb 6>; [6.25%] was the if-converted loop. This requires unpacking of char to int so this all boils down to failure to narrow all operations to 'short'. Might be doable with a match.pd pattern simplifying ((int) X) & -64 == 0 to X & -64 == 0 which we maybe already have but there's other uses of _4. ICC also uses effectively two vector sizes, v8qi and v8hi AFAICS? But why does it use %ymm then...
(In reply to Richard Biener from comment #2) > ICC also uses effectively two vector sizes, v8qi and v8hi AFAICS? But > why does it use %ymm then... I think it's v8qi and v8si, icc use vpmovzxbd not vpmovzxbw.
If we change x264_clip_uint8 slightly to: static inline uint8_t x264_clip_uint8( uint8_t x ) { uint8_t t = -x; t = t>>7; uint8_t t1 = x&(~63); return t1 ? t : x; } GCC does the "right" thing. This is all type demontion going on.
This is not just about type promotion/demotion. The results of (-x)>>7 done in unsigned char and in int are different. In the unsigned char case we do rshift by 7 of 8-bits value which leaves us with 7 zeroes in the higher bits and the 0 or 1 in the lowest bit. Since promoting unsigned char to int always produces positive int value then the -x will always be negative. So, in the int case we end up with the result where all 7 higher bits are set to 1 (being carried from the higher bits that are set in int) and 0 or 1 in the lowest bit. For instance for initial 0xff unsigned value: unsigned char: minus(0xff)->0x01; rshift7(0x01)->0x00 int: icast(0xff)->0x000000ff; minus(0x000000ff)->0xffffff01; rshift7(0xffffff01)->0x01fffffe; bcast(0x01fffffe)->0xfe
> (In reply to Anton Youdkevitch from comment #5) > This is not just about type promotion/demotion. > The results of (-x)>>7 done in unsigned char and in int are different. > You're right, in that the version given here can't be optimized when the input value is an unsigned char. The actual one in SPEC2017 can, because it's actually just doing clipping. However that one GCC already generates the most efficient form for these simple test cases. For the actual testcase there's a lot of inefficiency depending on the ISA, but that's an ISA specific issue not a mid-end one. The reported vectorization issue does indeed seem to be AVX512 specific.
I notice TARGET_VECTORIZE_RELATED_MODE is added, and can be used to handle convertion, i'm working on this.