Bug 92492 - AVX512: Missed vectorization opportunity
Summary: AVX512: Missed vectorization opportunity
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 92658
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2019-11-13 08:35 UTC by Hongtao.liu
Modified: 2024-06-08 14:31 UTC (History)
2 users (show)

See Also:
Host:
Target: i386, x86-64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-11-13 00:00:00


Attachments
testcase cut from 525.x264_r (851 bytes, text/plain)
2019-11-13 08:35 UTC, Hongtao.liu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hongtao.liu 2019-11-13 08:35:55 UTC
Created attachment 47231 [details]
testcase cut from 525.x264_r

for simple testcase
cat test.c:

typedef unsigned char uint8_t;

static inline uint8_t x264_clip_uint8( int x )
{
  return x&(~63) ? (-x)>>7 : x;
}


void mc_weight( uint8_t *dst, uint8_t *src, int i_width, int i_height )
{
	for( int x = 0; x < i_width; x++ )
	    dst[x] = x264_clip_uint8(src[x]);
}

Refer to

https://godbolt.org/z/TnACA-

Icc generate much better code by using vptestmd and maskmov

gcc loop:
------------
        vmovdqu8        (%rsi,%rax), %ymm6
        vpmovzxbw       %xmm6, %ymm3
        vpmovzxwd       %xmm3, %ymm0
        vextracti128    $0x1, %ymm3, %xmm3
        vpand   %ymm5, %ymm0, %ymm9
        vpmovzxwd       %xmm3, %ymm3
        vextracti128    $0x1, %ymm6, %xmm2
        vpcmpd  $0, %ymm4, %ymm9, %k0
        vpmovzxbw       %xmm2, %ymm2
        vpand   %ymm5, %ymm3, %ymm9
        vpcmpd  $0, %ymm4, %ymm9, %k1
        vpmovzxwd       %xmm2, %ymm1
        vextracti128    $0x1, %ymm2, %xmm2
        vpand   %ymm5, %ymm1, %ymm9
        vpmovzxwd       %xmm2, %ymm2
        vpcmpd  $0, %ymm4, %ymm9, %k2
        vpand   %ymm5, %ymm2, %ymm9
        kunpckbw        %k0, %k1, %k0
        vpcmpd  $0, %ymm4, %ymm9, %k1
        vpsubd  %ymm0, %ymm4, %ymm0
        vpsubd  %ymm3, %ymm4, %ymm3
        vpsubd  %ymm1, %ymm4, %ymm1
        vpsubd  %ymm2, %ymm4, %ymm2
        kunpckbw        %k2, %k1, %k1
        kunpckwd        %k0, %k1, %k1
        vpermt2w        %ymm3, %ymm8, %ymm0
        vpermt2w        %ymm2, %ymm8, %ymm1
        vpsraw  $7, %ymm0, %ymm0
        vpsraw  $7, %ymm1, %ymm1
        vpand   %ymm0, %ymm7, %ymm0
        vpand   %ymm1, %ymm7, %ymm1
        vpackuswb       %ymm1, %ymm0, %ymm0
        vpermq  $216, %ymm0, %ymm0
        vmovdqu8        %ymm6, %ymm0{%k1}
        vmovdqu8        %ymm0, (%rdi,%rax)
        addq    $32, %rax
        cmpq    %rcx, %rax
------------

icc loop:
----------
        vpmovzxbd (%rsi,%r8), %ymm3                             #12.31
        vptestmd  %ymm1, %ymm3, %k1                             #5.12
        vpsubd    %ymm3, %ymm0, %ymm2                           #12.31
        vpsrad    $7, %ymm2, %ymm3{%k1}                         #12.31
        vpmovdb   %ymm3, (%r8,%rdi)                             #12.6
        addq      $8, %r8                                       #11.2
        cmpq      %rcx, %r8                                     #11.2
        jb        ..B1.7        # Prob 82%                      #11.2
----------


origin case cut from SPEC2017 525.x264_r, refer to attachment
Comment 1 Hongtao.liu 2019-11-13 08:53:48 UTC
Much more simple case, exclude disturb of point alias and unknown loop count
cat test.c:

typedef unsigned char uint8_t;

static inline uint8_t x264_clip_uint8( int x )
{
  return x&(~63) ? (-x)>>7 : x;
}


void mc_weight( uint8_t *__restrict dst, uint8_t *__restrict src)
{
	for( int x = 0; x < 16; x++ )
	    dst[x] = x264_clip_uint8(src[x]);
}

Refer to https://godbolt.org/z/YJXWRD 

Gcc failed to vectorize loop, icc succeed.
Comment 2 Richard Biener 2019-11-13 12:05:46 UTC
t.c:12:14: missed:   not vectorized: relevant stmt not supported: _4 = (int) _3;

# x_25 = PHI <x_13(7), 0(15)>
# ivtmp_29 = PHI <ivtmp_28(7), 16(15)>
_1 = (sizetype) x_25;
_2 = src_9(D) + _1;
_3 = *_2;
_4 = (int) _3;
_5 = dst_10(D) + _1;
_11 = _4 & -64;
_14 = -_4;
_15 = _14 >> 7;
iftmp.0_16 = (unsigned char) _15;
iftmp.0_17 = _11 == 0 ? _3 : iftmp.0_16;
*_5 = iftmp.0_17;
x_13 = x_25 + 1;
ivtmp_28 = ivtmp_29 - 1;
if (ivtmp_28 != 0)
  goto <bb 7>; [93.75%]
else
  goto <bb 6>; [6.25%]

was the if-converted loop.  This requires unpacking of char to int so this
all boils down to failure to narrow all operations to 'short'.  Might
be doable with a match.pd pattern simplifying ((int) X) & -64 == 0
to X & -64 == 0 which we maybe already have but there's other uses of _4.

ICC also uses effectively two vector sizes, v8qi and v8hi AFAICS?  But
why does it use %ymm then...
Comment 3 Hongtao.liu 2019-11-14 03:22:47 UTC
(In reply to Richard Biener from comment #2)
> ICC also uses effectively two vector sizes, v8qi and v8hi AFAICS?  But
> why does it use %ymm then...

I think it's v8qi and v8si, icc use vpmovzxbd not vpmovzxbw.
Comment 4 Andrew Pinski 2020-01-07 10:20:45 UTC
If we change x264_clip_uint8 slightly to:

static inline uint8_t x264_clip_uint8( uint8_t x )
{
  uint8_t t = -x;
  t = t>>7;
  uint8_t t1 = x&(~63);
  return t1 ? t : x;
}


GCC does the "right" thing.

This is all type demontion going on.
Comment 5 Anton Youdkevitch 2020-02-18 18:36:28 UTC
This is not just about type promotion/demotion.
The results of (-x)>>7 done in unsigned char and in int are different.

In the unsigned char case we do rshift by 7 of 8-bits value which leaves us with 7 zeroes in the higher bits and the 0 or 1 in the lowest bit.

Since promoting unsigned char to int always produces positive int value then the -x will always be negative. So, in the int case we end up with the result where all 7 higher bits are set to 1 (being carried from the higher bits that are set in int) and 0 or 1 in the lowest bit.

For instance for initial 0xff unsigned value:

unsigned char:
minus(0xff)->0x01; rshift7(0x01)->0x00

int:
icast(0xff)->0x000000ff; minus(0x000000ff)->0xffffff01; rshift7(0xffffff01)->0x01fffffe; bcast(0x01fffffe)->0xfe
Comment 6 Tamar Christina 2020-02-21 13:57:45 UTC
> (In reply to Anton Youdkevitch from comment #5)
> This is not just about type promotion/demotion.
> The results of (-x)>>7 done in unsigned char and in int are different.
> 

You're right, in that the version given here can't be optimized when the input value is an unsigned char.

The actual one in SPEC2017 can, because it's actually just doing clipping. However that one GCC already generates the most efficient form for these simple test cases.

For the actual testcase there's a lot of inefficiency depending on the ISA, but that's an ISA specific issue not a mid-end one.

The reported vectorization issue does indeed seem to be AVX512 specific.
Comment 7 Hongtao.liu 2020-11-13 02:59:22 UTC
I notice TARGET_VECTORIZE_RELATED_MODE is added, and can be used to handle convertion, i'm working on this.