This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 24 May 2017 10:53:07 +0000
- Subject: [Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
- Auto-submitted: auto-generated
- References: <bug-80846-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2017-05-24
CC| |uros at gcc dot gnu.org
Blocks| |53947
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the vectorizer uses "whole vector shift" to do the final reduction:
vect_sum_11.8_5 = VEC_PERM_EXPR <vect_sum_11.6_6, { 0, 0, 0, 0, 0, 0, 0, 0 },
{ 4, 5, 6, 7, 8, 9, 10, 11 }>;
vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6;
vect_sum_11.8_19 = VEC_PERM_EXPR <vect_sum_11.8_20, { 0, 0, 0, 0, 0, 0, 0, 0
}, { 2, 3, 4, 5, 6, 7, 8, 9 }>;
vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20;
vect_sum_11.8_13 = VEC_PERM_EXPR <vect_sum_11.8_18, { 0, 0, 0, 0, 0, 0, 0, 0
}, { 1, 2, 3, 4, 5, 6, 7, 8 }>;
vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18;
stmp_sum_11.7_27 = BIT_FIELD_REF <vect_sum_11.8_26, 32, 0>;
I can see that for Zen that is bad (even for avx256 in general eventually
because
it crosses lanes).
That is, it was supposed to end up using pslldq, not the vperm + palign combos.
That said, the vectorizer could "easily" demote this to first add the two
halves
and then continue with the reduction scheme. The GIMPLE representation of this
is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86
backend can handle (hi/lo subregs?).
I'll see to handle this better in the vectorizer.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations