This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 24 May 2017 10:53:07 +0000
Subject: [Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
Auto-submitted: auto-generated
References: <bug-80846-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2017-05-24
                 CC|                            |uros at gcc dot gnu.org
             Blocks|                            |53947
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the vectorizer uses "whole vector shift" to do the final reduction:

  vect_sum_11.8_5 = VEC_PERM_EXPR <vect_sum_11.6_6, { 0, 0, 0, 0, 0, 0, 0, 0 },
{ 4, 5, 6, 7, 8, 9, 10, 11 }>;
  vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6;
  vect_sum_11.8_19 = VEC_PERM_EXPR <vect_sum_11.8_20, { 0, 0, 0, 0, 0, 0, 0, 0
}, { 2, 3, 4, 5, 6, 7, 8, 9 }>;
  vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20;
  vect_sum_11.8_13 = VEC_PERM_EXPR <vect_sum_11.8_18, { 0, 0, 0, 0, 0, 0, 0, 0
}, { 1, 2, 3, 4, 5, 6, 7, 8 }>;
  vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18;
  stmp_sum_11.7_27 = BIT_FIELD_REF <vect_sum_11.8_26, 32, 0>;

I can see that for Zen that is bad (even for avx256 in general eventually
because
it crosses lanes).

That is, it was supposed to end up using pslldq, not the vperm + palign combos.

That said, the vectorizer could "easily" demote this to first add the two
halves
and then continue with the reduction scheme.  The GIMPLE representation of this
is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86
backend can handle (hi/lo subregs?).

I'll see to handle this better in the vectorizer.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

References:
- [Bug target/80846] New: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]