This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 26 May 2017 08:50:08 +0000
Subject: [Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
Auto-submitted: auto-generated
References: <bug-80846-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(define_expand "<plusminus_insn><mode>3"
  [(set (match_operand:VI_AVX2 0 "register_operand")
        (plusminus:VI_AVX2
          (match_operand:VI_AVX2 1 "vector_operand")
          (match_operand:VI_AVX2 2 "vector_operand")))]
  "TARGET_SSE2"
  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")

so maybe things can be fixed up in ix86_fixup_binary_operands which doesn't
seem to consider subregs in any way.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 248482)
+++ gcc/config/i386/i386.c      (working copy)
@@ -21270,6 +21270,11 @@ ix86_fixup_binary_operands (enum rtx_cod
   if (MEM_P (src1) && !rtx_equal_p (dst, src1))
     src1 = force_reg (mode, src1);

+  if (SUBREG_P (src1) && SUBREG_BYTE (src1) != 0)
+    src1 = force_reg (mode, src1);
+  if (SUBREG_P (src2) && SUBREG_BYTE (src2) != 0)
+    src1 = force_reg (mode, src2);
+
   /* Improve address combine.  */
   if (code == PLUS
       && GET_MODE_CLASS (mode) == MODE_INT

doesn't help though.  pre-LRA:

(insn 19 16 20 4 (set (reg:V4SI 103)
        (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)) 1222
{movv4si_internal}
     (nil))
(insn 20 19 21 4 (set (reg:V4SI 98 [ _29 ])
        (plus:V4SI (reg:V4SI 103)
            (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
     (expr_list:REG_DEAD (reg:V4SI 103)
        (expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
            (nil))))

of course LRA not splitting life ranges when spilling (and thus forcing
to spill inside the loop) doesn't help either.  But we really don't want
to spill...

         Choosing alt 2 in insn 19:  (0) v  (1) vm {movv4si_internal}
            2 Non pseudo reload: reject++
          alt=1,overall=1,losers=0,rld_nregs=0
         Choosing alt 1 in insn 20:  (0) v  (1) v  (2) vm {*addv4si3}
          alt=1,overall=0,losers=0,rld_nregs=0

         Choosing alt 2 in insn 19:  (0) v  (1) vm {movv4si_internal}
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=0: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=1: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            Cycle danger: overall += LRA_MAX_REJECT

         Choosing alt 1 in insn 20:  (0) v  (1) v  (2) vm {*addv4si3}
            alt=0: Bad operand -- refuse
            alt=1: Bad operand -- refuse
          alt=2,overall=0,losers=0,rld_nregs=0

so we don't seem to handle insn 19 well (why's that movv4si_internal rather
than some pextr?)

References:
- [Bug target/80846] New: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]