[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
pthaugen at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon May 14 19:58:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698
--- Comment #6 from Pat Haugen <pthaugen at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> I can see what the patch does to this testcase on x86_64 - it enables BB
> vectorization of the first two loops after runrolling. I don't see anything
> suspicious here on x86_64 and 525.x264_r works fine for me.
>
> Can you claify whether test, ref or train inputs fail for you? I tried
> AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
> time...
>
> Can you check whether the following reduced file produces the same assembly
> for add4x4_idct as in the complete benchmark? If so it should be possible to
> generate a runtime testcase from it. Please attach preprocessed source if
> that doesn't work out.
>
> Sofar I do suspect we are hitting a latent target issue?
>
> #include <stdint.h>
> static uint8_t x264_clip_uint8( int x )
> {
> return x&(~255) ? (-x)>>31 : x;
> }
> void add4x4_idct( uint8_t *p_dst, int16_t dct[16])
> {
> int16_t d[16];
> int16_t tmp[16];
> for( int i = 0; i < 4; i++ )
> {
> int s02 = dct[0*4+i] + dct[2*4+i];
> int d02 = dct[0*4+i] - dct[2*4+i];
> int s13 = dct[1*4+i] + (dct[3*4+i]>>1);
> int d13 = (dct[1*4+i]>>1) - dct[3*4+i];
> tmp[i*4+0] = s02 + s13;
> tmp[i*4+1] = d02 + d13;
> tmp[i*4+2] = d02 - d13;
> tmp[i*4+3] = s02 - s13;
> }
> for( int i = 0; i < 4; i++ )
> {
> int s02 = tmp[0*4+i] + tmp[2*4+i];
> int d02 = tmp[0*4+i] - tmp[2*4+i];
> int s13 = tmp[1*4+i] + (tmp[3*4+i]>>1);
> int d13 = (tmp[1*4+i]>>1) - tmp[3*4+i];
> d[0*4+i] = ( s02 + s13 + 32 ) >> 6;
> d[1*4+i] = ( d02 + d13 + 32 ) >> 6;
> d[2*4+i] = ( d02 - d13 + 32 ) >> 6;
> d[3*4+i] = ( s02 - s13 + 32 ) >> 6;
> }
> for( int y = 0; y < 4; y++ )
> {
> for( int x = 0; x < 4; x++ )
> p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] );
> p_dst += 32;
> }
> }
Yes, that produces similar code, and adding the following to it produces an
executable test that fails at -O3.
void main()
{
uint8_t dst[128];
int16_t dct[16];
int i;
for (i = 0; i < 16; i++)
dct[i] = i*10 + i;
for (i = 0; i < 128; i++)
dst[i] = i;
add4x4_idct(dst, dct);
if (dst[0] != 14 || dst[1] != 0 || dst[2] != 4 || dst[3] != 2
|| dst[32] != 28 || dst[33] != 35 || dst[34] != 33 || dst[35] != 35)
abort();
}
Continuing to debug further...
More information about the Gcc-bugs
mailing list