This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82136] x86: -mavx256-split-unaligned-load should try to fold other shuffles into the load/vinsertf128
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 12 Sep 2017 08:47:10 +0000
- Subject: [Bug target/82136] x86: -mavx256-split-unaligned-load should try to fold other shuffles into the load/vinsertf128
- Auto-submitted: auto-generated
- References: <bug-82136-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2017-09-12
CC| |rguenth at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC just applies the general interleaving strategy here which for existing
groups can be indeed quite bad. And it gets worse because of the splitting
which isn't exposed to the vectorizer.
In the end the GIMPLE IL more nicely explains what the vectorizer tries to
do -- extract even/odd, mult/add and then interleave high/low:
vect_x_13.2_26 = MEM[base: _2, offset: 0B];
vect_x_13.3_22 = MEM[base: _2, offset: 32B];
vect_perm_even_21 = VEC_PERM_EXPR <vect_x_13.2_26, vect_x_13.3_22, { 0, 2, 4,
6 }>;
vect_perm_odd_20 = VEC_PERM_EXPR <vect_x_13.2_26, vect_x_13.3_22, { 1, 3, 5,
7 }>;
vect__7.4_19 = vect_perm_odd_20 * vect_perm_even_21;
vect__8.5_18 = vect_perm_odd_20 + vect_perm_even_21;
vect_inter_high_34 = VEC_PERM_EXPR <vect__7.4_19, vect__8.5_18, { 0, 4, 1, 5
}>;
vect_inter_low_29 = VEC_PERM_EXPR <vect__7.4_19, vect__8.5_18, { 2, 6, 3, 7
}>;
MEM[base: _2, offset: 0B] = vect_inter_high_34;
MEM[base: _2, offset: 32B] = vect_inter_low_29;
not sure what ends up messing things up here (I guess AVX256 doesn't have
full width extract even/odd and interleave high/low ...).
Looks like with -mprefer-avx128 we never try the larger vector size (Oops?).
At least we figure vectorization isn't profitable.
So all this probably boils down to costs of permutes not being modeled.