[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Fri May 11 11:19:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can see what the patch does to this testcase on x86_64 - it enables BB
vectorization of the first two loops after runrolling. I don't see anything
suspicious here on x86_64 and 525.x264_r works fine for me.
Can you claify whether test, ref or train inputs fail for you? I tried
AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
time...
Can you check whether the following reduced file produces the same assembly
for add4x4_idct as in the complete benchmark? If so it should be possible to
generate a runtime testcase from it. Please attach preprocessed source if
that doesn't work out.
Sofar I do suspect we are hitting a latent target issue?
#include <stdint.h>
static uint8_t x264_clip_uint8( int x )
{
return x&(~255) ? (-x)>>31 : x;
}
void add4x4_idct( uint8_t *p_dst, int16_t dct[16])
{
int16_t d[16];
int16_t tmp[16];
for( int i = 0; i < 4; i++ )
{
int s02 = dct[0*4+i] + dct[2*4+i];
int d02 = dct[0*4+i] - dct[2*4+i];
int s13 = dct[1*4+i] + (dct[3*4+i]>>1);
int d13 = (dct[1*4+i]>>1) - dct[3*4+i];
tmp[i*4+0] = s02 + s13;
tmp[i*4+1] = d02 + d13;
tmp[i*4+2] = d02 - d13;
tmp[i*4+3] = s02 - s13;
}
for( int i = 0; i < 4; i++ )
{
int s02 = tmp[0*4+i] + tmp[2*4+i];
int d02 = tmp[0*4+i] - tmp[2*4+i];
int s13 = tmp[1*4+i] + (tmp[3*4+i]>>1);
int d13 = (tmp[1*4+i]>>1) - tmp[3*4+i];
d[0*4+i] = ( s02 + s13 + 32 ) >> 6;
d[1*4+i] = ( d02 + d13 + 32 ) >> 6;
d[2*4+i] = ( d02 - d13 + 32 ) >> 6;
d[3*4+i] = ( s02 - s13 + 32 ) >> 6;
}
for( int y = 0; y < 4; y++ )
{
for( int x = 0; x < 4; x++ )
p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] );
p_dst += 32;
}
}
More information about the Gcc-bugs
mailing list