[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Mon Nov 28 07:21:48 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 28 Nov 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
> 
> --- Comment #24 from Hongtao.liu <crazylht at gmail dot com> ---
>   _233 = {f_im_36, f_re_35, f_re_35, f_re_35};
>   _217 = {f_re_35, f_im_36, f_im_36, f_im_36};
> ...
> vect_x_re_55.15_227 = VEC_PERM_EXPR <vect_x_im_61.14_228, vect_x_im_61.13_230,
> { 0, 5, 6, 7 }>;
>   vect_x_re_55.23_211 = VEC_PERM_EXPR <vect_x_im_61.13_230,
> vect_x_im_61.14_228, { 0, 5, 6, 7 }>;
> ...
>   vect_y_re_69.17_224 = .FNMA (vect_x_re_55.15_227, _233, vect_y_re_63.9_237);
>   vect_y_re_69.25_208 = .FNMA (vect_x_re_55.23_211, _217, vect_y_re_69.17_224);
> 
> is equal to
> 
>   _233 = {f_im_36,f_im_36, f_im_36, f_im_36}
>   _217 = {f_re_35, f_re_35, f_re_35, f_re_35};
> ...
>   vect_y_re_69.17_224 = .FNMA (vect_x_im_61.14_228, _233, vect_y_re_63.9_237)
>   vect_y_re_69.25_208 = .FNMA (vect_x_im_61.13_230, _217, vect_y_re_69.17_224)
> 
> A simplication in match.pd?

I guess that's possible but the SLP vectorizer has a permute optimization
phase (and SLP discovery itself), it would be nice to see why the former
doesn't elide the permutes here.