This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 26 May 2017 08:50:08 +0000
- Subject: [Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
- Auto-submitted: auto-generated
- References: <bug-80846-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(define_expand "<plusminus_insn><mode>3"
[(set (match_operand:VI_AVX2 0 "register_operand")
(plusminus:VI_AVX2
(match_operand:VI_AVX2 1 "vector_operand")
(match_operand:VI_AVX2 2 "vector_operand")))]
"TARGET_SSE2"
"ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")
so maybe things can be fixed up in ix86_fixup_binary_operands which doesn't
seem to consider subregs in any way.
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c (revision 248482)
+++ gcc/config/i386/i386.c (working copy)
@@ -21270,6 +21270,11 @@ ix86_fixup_binary_operands (enum rtx_cod
if (MEM_P (src1) && !rtx_equal_p (dst, src1))
src1 = force_reg (mode, src1);
+ if (SUBREG_P (src1) && SUBREG_BYTE (src1) != 0)
+ src1 = force_reg (mode, src1);
+ if (SUBREG_P (src2) && SUBREG_BYTE (src2) != 0)
+ src1 = force_reg (mode, src2);
+
/* Improve address combine. */
if (code == PLUS
&& GET_MODE_CLASS (mode) == MODE_INT
doesn't help though. pre-LRA:
(insn 19 16 20 4 (set (reg:V4SI 103)
(subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)) 1222
{movv4si_internal}
(nil))
(insn 20 19 21 4 (set (reg:V4SI 98 [ _29 ])
(plus:V4SI (reg:V4SI 103)
(subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
(expr_list:REG_DEAD (reg:V4SI 103)
(expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
(nil))))
of course LRA not splitting life ranges when spilling (and thus forcing
to spill inside the loop) doesn't help either. But we really don't want
to spill...
Choosing alt 2 in insn 19: (0) v (1) vm {movv4si_internal}
2 Non pseudo reload: reject++
alt=1,overall=1,losers=0,rld_nregs=0
Choosing alt 1 in insn 20: (0) v (1) v (2) vm {*addv4si3}
alt=1,overall=0,losers=0,rld_nregs=0
Choosing alt 2 in insn 19: (0) v (1) vm {movv4si_internal}
0 Non-pseudo reload: reject+=2
0 Non input pseudo reload: reject++
alt=0: Bad operand -- refuse
0 Non-pseudo reload: reject+=2
0 Non input pseudo reload: reject++
alt=1: Bad operand -- refuse
0 Non-pseudo reload: reject+=2
0 Non input pseudo reload: reject++
Cycle danger: overall += LRA_MAX_REJECT
Choosing alt 1 in insn 20: (0) v (1) v (2) vm {*addv4si3}
alt=0: Bad operand -- refuse
alt=1: Bad operand -- refuse
alt=2,overall=0,losers=0,rld_nregs=0
so we don't seem to handle insn 19 well (why's that movv4si_internal rather
than some pextr?)