[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
linkw at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Sun Sep 27 10:42:31 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
--- Comment #27 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #22)
> >One of my workmates found that if we disable vectorization for SPEC2017 >525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with ?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
>
> For CLX, if we disable slp vectorization in sub4x4_dct by
> __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by
> 4%.
>
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do
> > the bb_info slp_instances revisits in finish_cost.
>
> For i386 backend unsigned char --> unsigned short is no "no instruction",
Thanks for the information, it means it's target specific.
> but in this case
> ---
> 1033 _134 = MEM[(pixel *)pix1_295 + 2B];
>
> 1034 _135 = (short unsigned int) _134;
> ---
>
> It could be combined and optimized to
> ---
> movzbl 19(%rcx), %r8d
> ---
>
> So, if "unsigned char" variable is loaded from memory, then the convertion
> would also be "no instruction", i'm not sure if backend cost model could
> handle such situation.
Probably you can try to tweak it in ix86_add_stmt_cost? when the statement is
UB to UH conversion statement, further check if the def of the input UB is MEM.
More information about the Gcc-bugs
mailing list