[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Wed Sep 16 11:17:55 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #7)
> Two questions in mind, need to dig into it further:
>   1) from the assembly of scalar/vector code, I don't see any stores needed
> into temp array d (array diff in pixel_sub_wxh), but when modeling we
> consider the stores.

Because when modeling they are still there.  There's no good way around this.

> On Power two vector stores take cost 2 while 16 scalar
> stores takes cost 16, it seems wrong to cost model something useless. Later,
> for the vector version we need 16 vector halfword extractions from these two
> halfword vectors, while scalar version the values are just in GPR register,
> vector version looks inefficient.
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 
> The nop conversion costing looks something we can handle in function
> rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
> tree_nop_conversion_p, but it's only for same mode/precision conversion.
> Will find/check something else.