80846 – auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Bug 80846 - auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be m...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2017-05-21 03:02 UTC by Peter Cordes
Modified:	2018-01-16 20:47 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-05-24 00:00:00

Attachments
WIP patch (1.89 KB, patch) 2017-05-26 09:11 UTC, Richard Biener	Details \| Diff
adjusted tree-vect-loop.c hunk (1.63 KB, text/plain) 2017-05-26 09:56 UTC, Richard Biener	Details
gcc8-pr80846.patch (2.56 KB, patch) 2017-07-19 11:57 UTC, Jakub Jelinek	Details \| Diff
i386-pc-solaris2.11 -m32 assembler output (385 bytes, text/plain) 2018-01-14 19:39 UTC, Rainer Orth	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Cordes 2017-05-21 03:02:57 UTC

gcc's tune=generic strategy for horizontal sums at the end of an AVX2 auto-vectorized reduction is sub-optimal even for Intel CPUs, and horrible for CPUs that split 256b ops into 2 uops (i.e. AMD including Ryzen).

The first step should always be vextracti/f128 to reduce down to xmm vectors. It has low latency and good throughput on all CPUs. Keep narrowing in half until you're down to one element. See also http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026

The one exception is a horizontal sum of 8-bit elements without overflow, where you should use VPSADBW ymm against a zeroed to do a horizontal add without overflow, and then extract hsum the resulting four 64-bit values. (For signed 8-bit, you can range-shift to unsigned and then correct the scalar hsum result.)

----

gcc tends to keep working with ymm vectors until the last step, even using VPERM2I128 (which is horrible on AMD CPUs, e.g. Ryzen: 8 uops with 3c throughput/latency vs. VEXTRACTI128 being 1 uop with 1c latency and 0.33c throughput).

// https://godbolt.org/g/PwX6yT
int sumint(const int arr[]) {
arr = __builtin_assume_aligned(arr, 64);
int sum=0;
for (int i=0 ; i<1024 ; i++)
sum+=arr[i];
return sum;
}

Compiled with gcc8.0.0 20170520 -mavx2 -funroll-loops -O3 -std=gnu11, we get

vpxor %xmm7, %xmm7, %xmm7
leaq 4096(%rdi), %rax
.L24:
vpaddd (%rdi), %ymm7, %ymm0
addq $256, %rdi # doing this later would let more instructions use a disp8 instead of disp32
vpaddd -224(%rdi), %ymm0, %ymm1
vpaddd -192(%rdi), %ymm1, %ymm2
vpaddd -160(%rdi), %ymm2, %ymm3
vpaddd -128(%rdi), %ymm3, %ymm4
vpaddd -96(%rdi), %ymm4, %ymm5
vpaddd -64(%rdi), %ymm5, %ymm6
vpaddd -32(%rdi), %ymm6, %ymm7 # unrolling without multiple accumulators loses a lot of the benefit.
cmpq %rdi, %rax
jne .L24

# our single accumulator is currently in ymm7
vpxor %xmm8, %xmm8, %xmm8 # Ryzen uops: 1 latency: x
vperm2i128 $33, %ymm8, %ymm7, %ymm9 # 8 3
vpaddd %ymm7, %ymm9, %ymm10 # 2 1
vperm2i128 $33, %ymm8, %ymm10, %ymm11 # 8 3
vpalignr $8, %ymm10, %ymm11, %ymm12 # 2 1
vpaddd %ymm12, %ymm10, %ymm13 # 2 1
vperm2i128 $33, %ymm8, %ymm13, %ymm14 # 8 3
vpalignr $4, %ymm13, %ymm14, %ymm15 # 2 1
vpaddd %ymm15, %ymm13, %ymm0 # 2 1
vmovd %xmm0, %eax # 1 3

vzeroupper
ret

Using x/ymm8-15 as src1 needs a 3-byte VEX prefix instead of 2-byte, so the epilogue should reuse xmm0-6 to save code-size. They're dead, and no x86 CPUs have write-after-write dependencies.

More importantly, the shuffle strategy is just bad. There should be only one shuffle between each VPADDD. I'd suggest

vextracti128 $1, %ymm7, %xmm0
vpaddd %xmm7,%xmm0,%xmm0
# Then a 128b hsum, which can use the same strategy as if we'd started with 128b
vpunpckhqdq %xmm0,%xmm0,%xmm1 # Avoids an immediate, but without AVX use PSHUFD to copy+shuffle
vpaddd %xmm1,%xmm0,%xmm0
vpshuflw $0x4e,%xmm0,%xmm1 # or PSHUFD, or MOVSHDUP
vpaddd %xmm1,%xmm0,%xmm0
vmovd %xmm0,%eax

This is faster on Haswell by a significant margin, from avoiding the lane-crossing VPERM2I128 shuffles. It's also smaller.

All of these instructions are 1 uop / 1c latency on Ryzen (except movd), so this is 7 uops / 9c latency. GCC's current code is 36 uops, 17c on Ryzen. Things are similar on Bulldozer-family, vector ops have at least 2c latency.

An interesting alternative is possible for the last narrowing step with BMI2:

vmovq %xmm0, %rax
rorx $32, %rax, %rdx
add %edx, %eax

RORX can run on ports 0 and 6 on Intel CPUs that support it, and it's fast on Ryzen (1c latency 0.25c throughput). If there are further vector instructions, this reduces pressure on the vector ALU ports. The only CPU where this is really good is Excavator (bdver4?), assuming vector shuffle and VPADDD are still 2c latency each, while RORX and ADD are 1c. (Agner Fog's spreadsheet doesn't have an Excavator tab).

I think it's a code-size win, since integer add is only 2 bytes. (Or 3B for r8d-r15d REX), but that's still smaller than vector instructions. OTOH, RORX needs an immediate and a 3-byte VEX, so it's 1 byte bigger than VPSHUFD (unless PSHUFD used xmm8-15 and needs a 3B VEX).

Without BMI2, on CPUs with efficient SHLD, you could SHLD $32, %rax, %rdx. But that has a false dependency on %rdx, so we'd have to pick a register that's known to be ready. The pointer register from the loop would work, since the other inputs to SHLD are dependent on loads using that pointer. (Inserting an xor-zeroing to break the dep would lose the tiny advantage this has). But it's slow on Bulldozer/Zen, so probably a bad idea because -mtune should avoid making code that really sucks on other CPUs when it's only a tiny gain on the tune target.

---

2x PHADDD would be even smaller code-size, but that's its only advantage. It's 3 or 4 uops on all CPUs, and has worse latency than separate shuffle and add instructions. The same argument applies to FP horizontal sums: HADDPS isn't usually a good idea either, except with -Os. Unfortunately gcc does use it there, even without -Os. HADDPS ymm is 8 uops on Ryzen, with 7c latency. Agner Fog says "mixed domain", i.e. that it wants its input in the int domain (for the shuffles) but produces a result in the FP domain, probably with a bypass delay internally.

---

PSHUFLW is faster than PSHUFD on some old CPUs (like K8 and Merom/Conroe). Irrelevant for AVX, but it's not slower or lower throughput on any recent CPUs, so there's no harm in using the same strategy for SSE2 and AVX.

Also, all current CPUs that support AVX have no penalty for using VMOVSHDUP between VPADDD instructions, so it's a handy way to save the immediate byte vs. VPSHUFD or VPSRLDQ. It's a poor choice without AVX for -mtune=generic, because it has a penalty on Nehalem. It's may not be worth it to add a bunch of logic to decide when to save 1 byte this way, since that's all it does except on old SlowShuffle CPUs like Core2 Merom where MOVSHDUP is faster than PSHUFD.

-----

The -msse4 strategy is also sub-optimal: It can avoid the movdqa copies by using the same copy-and-shuffle instructions as AVX.

# from gcc8 with -msse4
movdqa %xmm0, %xmm1
psrldq $8, %xmm1
paddd %xmm1, %xmm0
movdqa %xmm0, %xmm2
psrldq $4, %xmm2
paddd %xmm2, %xmm0
movd %xmm0, %eax

When tuning for targets that don't have extra latency for FP shuffles between ivec instructions, instructions like movshdup and movhlps save code size (no immediate, and PS instructions have fewer prefixes) vs. pshufd. According to Agner Fog, Ryzen runs MOVHLPS and many other FP shuffles (including SHUFPS) in the ivec domain. (unpckh/lps/d is FP domain, though, so should be preferred for FP reductions).

------

Related: -mtune=znver1 should probably vectorize with YMM vectors, when it can be done as efficiently as with XMM vectors. As I understand it, Ryzen's maximum sustained throughput is 6 uops per clock for double-uop instructions, but only 5 uops per clock when running single-uop instructions. (And I think this is true even when running from the uop cache).

Also, with gcc's tendency to not unroll and use only a single vector accumulator, using a wider vector is like unrolling with 2 accumulators. You may still bottleneck on ADDPS latency, for example, but you get twice as much work done per clock.

Comment 1 Richard Biener 2017-05-24 10:53:07 UTC

So the vectorizer uses "whole vector shift" to do the final reduction:

  vect_sum_11.8_5 = VEC_PERM_EXPR <vect_sum_11.6_6, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 4, 5, 6, 7, 8, 9, 10, 11 }>;
  vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6;
  vect_sum_11.8_19 = VEC_PERM_EXPR <vect_sum_11.8_20, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6, 7, 8, 9 }>;
  vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20;
  vect_sum_11.8_13 = VEC_PERM_EXPR <vect_sum_11.8_18, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 2, 3, 4, 5, 6, 7, 8 }>;
  vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18;
  stmp_sum_11.7_27 = BIT_FIELD_REF <vect_sum_11.8_26, 32, 0>;

I can see that for Zen that is bad (even for avx256 in general eventually because
it crosses lanes).

That is, it was supposed to end up using pslldq, not the vperm + palign combos.

That said, the vectorizer could "easily" demote this to first add the two halves
and then continue with the reduction scheme.  The GIMPLE representation of this
is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86
backend can handle (hi/lo subregs?).

I'll see to handle this better in the vectorizer.

Comment 2 Peter Cordes 2017-05-24 21:42:57 UTC

(In reply to Richard Biener from comment #1)
> That is, it was supposed to end up using pslldq

I think you mean PSRLDQ.  Byte zero is the right-most when drawn in a way that makes bit/byte shift directions all match up with the diagram.  Opposite of array-initializer order.


PSRLDQ is sub-optimal without AVX.  It needs a MOVDQA to copy-and-shuffle.  For integer shuffles, PSHUFD is what you want (and PSHUFLW for a final step with 16-bit elements).

None of the x86 copy-and-shuffle instructions can zero an element, only copy from one of the source elements.  PSHUFD can easily swap the high and low halves, but maybe some other targets can't do that as efficiently as just duplicating the high half into the low half or something.  (I only really know x86 SIMD).

Ideally we could tell the back-end that the high half values are actually don't-care and can be anything, so it can choose the best shuffles to extract the high half for each successive narrowing step.

I think clang does this: its shuffle-optimizer makes different choices in a function that returns __m128 vs. returning the low element of a vector as a scalar float.  (e.g. for hand-coded horizontal sum using Intel _mm_ intrinsics).

-------

To get truly optimal code, the backend needs more choice in what order to do the shuffles.  e.g. with SSE3, the optimal sequence for FP hsum is probably

   movshdup  %xmm0, %xmm1   # DCBA -> DDBB
   addps     %xmm1, %xmm0   # D+D C+D B+B A+B  (elements 0 and 2 are valuable)
   movhlps   %xmm0, %xmm1   # Do this second so a dependency on xmm1 can't be a problem
   addss     %xmm1, %xmm0   # addps saves 1 byte of code.

We could use MOVHLPS first to maintain the usual successive-narrowing pattern, but (to avoid a MOVAPS) only if we have a scratch register that was ready earlier in xmm0's dep chain (so introducing a dependency on it can't hurt the critical path).  It also needs to be holding FP values that won't cause a slowdown from denormals in the high two elements for the first addps.  (NaN / infinity are ok for all current SSE hardware).  Adding a number to itself is safe enough, so a shuffle that duplicates the high-half values is good.

However, when auto-vectorizing an FP reduction with -fassociative-math but without the rest of -ffast-math, I guess we need to avoid spurious exceptions from values in elements that are otherwise don't-care.  Swapping high/low halves is always safe, e.g. using movaps + shufps for both steps:

  DCBA -> BADC and get D+B C+A B+D A+C.
  WXYZ -> XWZY and get D+B+C+A repeated four times

With AVX, the MOVAPS instructions go away, but vshufps's immediate byte still makes it 1 byte larger than vmovhlps or vunpckhpd.

x86 has very few FP copy-and-shuffle instructions, so it's a trickier problem than for integer code where you can always just use PSHUFD unless tuning for SlowShuffle CPUs like first-gen Core2, or K8.

With AVX, VPERMILPS with an immediate operand is pointless, except I guess with a memory source.  It always needs a 3-byte VEX, but VSHUFPS can use a 2-byte VEX prefix and do the same copy+in-lane-shuffle just as fast on all CPUs (using the same register as both sources), except KNL where single-source shuffles are faster.  Moreover, 3-operand AVX makes it possible to use VMOVHLPS or VUNPCKHPD as a copy+shuffle.

------


> demote this to first add the two halves and then continue with the reduction scheme.

Sounds good.

With x86 AVX512, it takes two successive narrowing steps to get down to 128bit vectors.  Narrowing to 256b allows shorter instructions (VEX instead of EVEX).

Even with 512b or 256b execution units, narrowing to 128b is about as efficient as possible.  Doing the lower-latency in-lane shuffles first would let more instructions retire earlier, but only by a couple cycles.  I don't think it makes any sense to special-case for AVX512 and do in-lane 256b ops ending with vextractf128, especially since gcc's current design does most of the reduction strategy in target-independent code.

IDK if any AVX512 CPU will ever handle wide vectors as two or four 128b uops.  It seems unlikely since 512b lane-crossing shuffles would have to decode to *many* uops, especially stuff like VPERMI2PS (select elements from two 512b sources).  Still, AMD seems to like the half-width idea, so I wouldn't be surprised to see an AVX512 CPU with 256b execution units.  Even 128b is plausible (especially for low-power / small-core CPUs like Jaguar or Silvermont).

Comment 3 Richard Biener 2017-05-26 08:33:55 UTC

Hmm, for some reason we end up spilling:

.L2:
        vmovdqa -48(%rbp), %ymm3
        addq    $32, %rdi
        vpaddd  -32(%rdi), %ymm3, %ymm2
        cmpq    %rdi, %rax
        vmovdqa %ymm2, -48(%rbp)
        jne     .L2
        vmovdqa -32(%rbp), %xmm5
        vpaddd  -48(%rbp), %xmm5, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax

when expanding the epilogue

  _6 = BIT_FIELD_REF <vect_sum_11.6_14, 128, 0>;
  _5 = BIT_FIELD_REF <vect_sum_11.6_14, 128, 128>;
  _29 = _5 + _6;
  vect_sum_11.8_22 = VEC_PERM_EXPR <_29, { 0, 0, 0, 0 }, { 2, 3, 4, 5 }>;
  vect_sum_11.8_21 = vect_sum_11.8_22 + _29;
  vect_sum_11.8_20 = VEC_PERM_EXPR <vect_sum_11.8_21, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }>;
  vect_sum_11.8_19 = vect_sum_11.8_20 + vect_sum_11.8_21;
  stmp_sum_11.7_18 = BIT_FIELD_REF <vect_sum_11.8_19, 32, 0>;
  return stmp_sum_11.7_18;

as

;; _29 = _5 + _6;

(insn 17 16 18 (set (reg:OI 101)
        (subreg:OI (reg:V8SI 90 [ vect_sum_11.6 ]) 0)) -1
     (nil))

(insn 18 17 19 (set (reg:OI 102)
        (subreg:OI (reg:V8SI 90 [ vect_sum_11.6 ]) 0)) -1
     (nil))

(insn 19 18 0 (set (reg:V4SI 98 [ _29 ])
        (plus:V4SI (subreg:V4SI (reg:OI 101) 16)
            (subreg:V4SI (reg:OI 102) 0))) -1
     (nil))

before RA:

(insn 19 16 20 4 (set (reg:V4SI 98 [ _29 ])
        (plus:V4SI (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)
            (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
     (expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
        (nil)))

so the issue is likely that we do not expose the splitting in separate instructions but we assume LRA can deal with reloading the above w/o stack?

         Choosing alt 1 in insn 19:  (0) v  (1) v  (2) vm {*addv4si3}
            alt=0: Bad operand -- refuse
            alt=1: Bad operand -- refuse
          alt=2,overall=0,losers=0,rld_nregs=0

not sure if it would help not to combine the subregs into the plus in the
first place or some reload hook could help here?

With the above issue the patch I have is likely not going to help ;)

Comment 4 Richard Biener 2017-05-26 08:50:08 UTC

(define_expand "<plusminus_insn><mode>3"
  [(set (match_operand:VI_AVX2 0 "register_operand")
        (plusminus:VI_AVX2
          (match_operand:VI_AVX2 1 "vector_operand")
          (match_operand:VI_AVX2 2 "vector_operand")))]
  "TARGET_SSE2"
  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")

so maybe things can be fixed up in ix86_fixup_binary_operands which doesn't
seem to consider subregs in any way.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 248482)
+++ gcc/config/i386/i386.c      (working copy)
@@ -21270,6 +21270,11 @@ ix86_fixup_binary_operands (enum rtx_cod
   if (MEM_P (src1) && !rtx_equal_p (dst, src1))
     src1 = force_reg (mode, src1);
 
+  if (SUBREG_P (src1) && SUBREG_BYTE (src1) != 0)
+    src1 = force_reg (mode, src1);
+  if (SUBREG_P (src2) && SUBREG_BYTE (src2) != 0)
+    src1 = force_reg (mode, src2);
+
   /* Improve address combine.  */
   if (code == PLUS
       && GET_MODE_CLASS (mode) == MODE_INT

doesn't help though.  pre-LRA:

(insn 19 16 20 4 (set (reg:V4SI 103)
        (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)) 1222 {movv4si_internal}
     (nil))
(insn 20 19 21 4 (set (reg:V4SI 98 [ _29 ])
        (plus:V4SI (reg:V4SI 103)
            (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
     (expr_list:REG_DEAD (reg:V4SI 103)
        (expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
            (nil))))

of course LRA not splitting life ranges when spilling (and thus forcing
to spill inside the loop) doesn't help either.  But we really don't want
to spill...

         Choosing alt 2 in insn 19:  (0) v  (1) vm {movv4si_internal}
            2 Non pseudo reload: reject++
          alt=1,overall=1,losers=0,rld_nregs=0
         Choosing alt 1 in insn 20:  (0) v  (1) v  (2) vm {*addv4si3}
          alt=1,overall=0,losers=0,rld_nregs=0

         Choosing alt 2 in insn 19:  (0) v  (1) vm {movv4si_internal}
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=0: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=1: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            Cycle danger: overall += LRA_MAX_REJECT

         Choosing alt 1 in insn 20:  (0) v  (1) v  (2) vm {*addv4si3}
            alt=0: Bad operand -- refuse
            alt=1: Bad operand -- refuse
          alt=2,overall=0,losers=0,rld_nregs=0

so we don't seem to handle insn 19 well (why's that movv4si_internal rather
than some pextr?)

Comment 5 Richard Biener 2017-05-26 09:11:52 UTC

Created attachment 41421 [details]
WIP patch

Comment 6 Richard Biener 2017-05-26 09:13:24 UTC

Similar with AVX512F I get

.L2:
        vmovdqa64       -112(%rbp), %zmm3
        addq    $64, %rdi
        vpaddd  -64(%rdi), %zmm3, %zmm2
        cmpq    %rdi, %rax
        vmovdqa64       %zmm2, -112(%rbp)
        jne     .L2
        vmovdqa -80(%rbp), %ymm0
        vpaddd  -112(%rbp), %ymm0, %ymm5
        vmovdqa %ymm5, -112(%rbp)
        vmovdqa -96(%rbp), %xmm0
        vpaddd  -112(%rbp), %xmm0, %xmm0
...

Comment 7 Richard Biener 2017-05-26 09:54:47 UTC

Note that similar to the vec_init optab not allowing constructing larger vectors from smaller ones vec_extract doesn't allow extracting smaller vectors from larger ones.  So I might be forced to go V8SI -> V2TI -> TI -> V4SI to make the RTL expander happy.

In that case for AVX512 we'll expose 256bit integers into the IL which means VRP will barf immediately and we may so anyway as if optimization somehow exposes
a constant that won't fit in widest_int...  Not sure if there's even V2OImode
supported.

For AVX2 we then expand

  _6 = VIEW_CONVERT_EXPR<vector(2) uint128_t>(vect_sum_11.6_14);
  _5 = BIT_FIELD_REF <_6, 128, 0>;
  _29 = VIEW_CONVERT_EXPR<vector(4) int>(_5);
  _22 = BIT_FIELD_REF <_6, 128, 128>;
  _21 = VIEW_CONVERT_EXPR<vector(4) int>(_22);
  _20 = _21 + _29;

to

;; _20 = _21 + _29;

(insn 18 17 19 (set (reg:OI 104)
        (subreg:OI (reg:V2TI 89 [ _6 ]) 0)) -1
     (nil))

(insn 19 18 20 (set (reg:TI 105)
        (subreg:TI (reg:OI 104) 16)) -1
     (nil))

(insn 20 19 21 (set (reg:OI 106)
        (subreg:OI (reg:V2TI 89 [ _6 ]) 0)) -1
     (nil))

(insn 21 20 22 (set (reg:TI 107)
        (subreg:TI (reg:OI 106) 0)) -1
     (nil))

(insn 22 21 0 (set (reg:V4SI 95 [ _20 ])
        (plus:V4SI (subreg:V4SI (reg:TI 105) 0)
            (subreg:V4SI (reg:TI 107) 0))) -1
     (nil))

which doesn't seem any better ... (still using subregs rather than going
through vec_extract).

(define_expand "vec_extract<mode>"
  [(match_operand:<ssescalarmode> 0 "register_operand")
   (match_operand:VEC_EXTRACT_MODE 1 "register_operand")
   (match_operand 2 "const_int_operand")]
  "TARGET_SSE"
{
  ix86_expand_vector_extract (false, operands[0], operands[1],
                              INTVAL (operands[2]));
  DONE;
})

ssescalarmode doesn't include TImode.

Note there's the corresponding bug that says x86 doesn't implement
{ TImode, TImode } vec_init either.

Comment 8 Richard Biener 2017-05-26 09:56:16 UTC

Created attachment 41422 [details]
adjusted tree-vect-loop.c hunk

Comment 9 Jakub Jelinek 2017-07-19 11:57:52 UTC

Created attachment 41789 [details]
gcc8-pr80846.patch

Untested i386 backend V2TImode and V4TImode vec_extract/vec_init and move improvements.

Comment 10 Jakub Jelinek 2017-07-20 16:36:51 UTC

Author: jakub
Date: Thu Jul 20 16:36:18 2017
New Revision: 250397

URL: https://gcc.gnu.org/viewcvs?rev=250397&root=gcc&view=rev
Log:
	PR target/80846
	* config/i386/i386.c (ix86_expand_vector_init_general): Handle
	V2TImode and V4TImode.
	(ix86_expand_vector_extract): Likewise.
	* config/i386/sse.md (VMOVE): Enable V4TImode even for just
	TARGET_AVX512F, instead of only for TARGET_AVX512BW.
	(ssescalarmode): Handle V4TImode and V2TImode.
	(VEC_EXTRACT_MODE): Add V4TImode and V2TImode.
	(*vec_extractv2ti, *vec_extractv4ti): New insns.
	(VEXTRACTI128_MODE): New mode iterator.
	(splitter for *vec_extractv?ti first element): New.
	(VEC_INIT_MODE): New mode iterator.
	(vec_init<mode>): Consolidate 3 expanders into one using
	VEC_INIT_MODE mode iterator.

	* gcc.target/i386/avx-pr80846.c: New test.
	* gcc.target/i386/avx2-pr80846.c: New test.
	* gcc.target/i386/avx512f-pr80846.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx-pr80846.c
    trunk/gcc/testsuite/gcc.target/i386/avx2-pr80846.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr80846.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog

Comment 11 Richard Biener 2017-07-26 14:09:31 UTC

So after Jakubs update the vectorizer patch yields

sumint:
.LFB0:
        .cfi_startproc
        vpxor   %xmm0, %xmm0, %xmm0
        leaq    4096(%rdi), %rax
        .p2align 4,,10
        .p2align 3
.L2:
        vpaddd  (%rdi), %ymm0, %ymm0
        addq    $32, %rdi
        cmpq    %rdi, %rax
        jne     .L2
        vextracti128    $1, %ymm0, %xmm1
        vpaddd  %xmm0, %xmm1, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        vzeroupper
        ret

that's not using the unpacking strategy (sum adjacent elements) but still the
vector shift approach (add upper/lower halves).  That's sth that can be
changed independently.

Waiting for final vec_extract/init2 optab settling.

Comment 12 Jakub Jelinek 2017-08-01 08:26:46 UTC

Author: jakub
Date: Tue Aug  1 08:26:14 2017
New Revision: 250759

URL: https://gcc.gnu.org/viewcvs?rev=250759&root=gcc&view=rev
Log:
	PR target/80846
	* optabs.def (vec_extract_optab, vec_init_optab): Change from
	a direct optab to conversion optab.
	* optabs.c (expand_vector_broadcast): Use convert_optab_handler
	with GET_MODE_INNER as last argument instead of optab_handler.
	* expmed.c (extract_bit_field_1): Likewise.  Use vector from
	vector extraction if possible and optab is available.
	* expr.c (store_constructor): Use convert_optab_handler instead
	of optab_handler.  Use vector initialization from smaller
	vectors if possible and optab is available.
	* tree-vect-stmts.c (vectorizable_load): Likewise.
	* doc/md.texi (vec_extract, vec_init): Document that the optabs
	now have two modes.
	* config/i386/i386.c (ix86_expand_vector_init): Handle expansion
	of vec_init from half-sized vectors with the same element mode.
	* config/i386/sse.md (ssehalfvecmode): Add V4TI case.
	(ssehalfvecmodelower, ssescalarmodelower): New mode attributes.
	(reduc_plus_scal_v8df, reduc_plus_scal_v4df, reduc_plus_scal_v2df,
	reduc_plus_scal_v16sf, reduc_plus_scal_v8sf, reduc_plus_scal_v4sf,
	reduc_<code>_scal_<mode>, reduc_umin_scal_v8hi): Add element mode
	after mode in gen_vec_extract* calls.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><ssescalarmodelower>): ... this.
	(vec_extract<mode><ssehalfvecmodelower>): New expander.
	(rotl<mode>3, rotr<mode>3, <shift_insn><mode>3, ashrv2di3): Add
	element mode after mode in gen_vec_init* calls.
	(VEC_INIT_HALF_MODE): New mode iterator.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><ssescalarmodelower>): ... this.
	(vec_init<mode><ssehalfvecmodelower>): New expander.
	* config/i386/mmx.md (vec_extractv2sf): Renamed to ...
	(vec_extractv2sfsf): ... this.
	(vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	(vec_extractv2si): Renamed to ...
	(vec_extractv2sisi): ... this.
	(vec_initv2si): Renamed to ...
	(vec_initv2sisi): ... this.
	(vec_extractv4hi): Renamed to ...
	(vec_extractv4hihi): ... this.
	(vec_initv4hi): Renamed to ...
	(vec_initv4hihi): ... this.
	(vec_extractv8qi): Renamed to ...
	(vec_extractv8qiqi): ... this.
	(vec_initv8qi): Renamed to ...
	(vec_initv8qiqi): ... this.
	* config/rs6000/vector.md (VEC_base_l): New mode attribute.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><VEC_base_l>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><VEC_base_l>): ... this.
	* config/rs6000/paired.md (vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	* config/rs6000/altivec.md (splitter, altivec_copysign_v4sf3,
	vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi,
	vec_unpacku_lo_v8hi, mulv16qi3, altivec_vreve<mode>2): Add
	element mode after mode in gen_vec_init* calls.
	* config/aarch64/aarch64-simd.md (vec_init<mode>): Renamed to ...
	(vec_init<mode><Vel>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><Vel>): ... this.
	* config/aarch64/iterators.md (Vel): New mode attribute.
	* config/s390/s390.c (s390_expand_vec_strlen, s390_expand_vec_movstr):
	Add element mode after mode in gen_vec_extract* calls.
	* config/s390/vector.md (non_vec_l): New mode attribute.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><non_vec_l>): ... this.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><non_vec_l>): ... this.
	* config/s390/s390-builtins.def (s390_vlgvb, s390_vlgvh, s390_vlgvf,
	s390_vlgvf_flt, s390_vlgvg, s390_vlgvg_dbl): Add element mode after
	vec_extract mode.
	* config/arm/iterators.md (V_elem_l): New mode attribute.
	* config/arm/neon.md (vec_extract<mode>): Renamed to ...
	(vec_extract<mode><V_elem_l>): ... this.
	(vec_extractv2di): Renamed to ...
	(vec_extractv2didi): ... this.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><V_elem_l>): ... this.
	(reduc_plus_scal_<mode>, reduc_plus_scal_v2di, reduc_smin_scal_<mode>,
	reduc_smax_scal_<mode>, reduc_umin_scal_<mode>,
	reduc_umax_scal_<mode>, neon_vget_lane<mode>, neon_vget_laneu<mode>):
	Add element mode after gen_vec_extract* calls.
	* config/mips/mips-msa.md (vec_init<mode>): Renamed to ...
	(vec_init<mode><unitmode>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><unitmode>): ... this.
	* config/mips/loongson.md (vec_init<mode>): Renamed to ...
	(vec_init<mode><unitmode>): ... this.
	* config/mips/mips-ps-3d.md (vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	(vec_extractv2sf): Renamed to ...
	(vec_extractv2sfsf): ... this.
	(reduc_plus_scal_v2sf, reduc_smin_scal_v2sf, reduc_smax_scal_v2sf):
	Add element mode after gen_vec_extract* calls.
	* config/mips/mips.md (unitmode): New mode iterator.
	* config/spu/spu.c (spu_expand_prologue, spu_allocate_stack,
	spu_builtin_extract): Add element mode after gen_vec_extract* calls.
	* config/spu/spu.md (inner_l): New mode attribute.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><inner_l>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><inner_l>): ... this.
	* config/sparc/sparc.md (veltmode): New mode iterator.
	(vec_init<VMALL:mode>): Renamed to ...
	(vec_init<VMALL:mode><VMALL:veltmode>): ... this.
	* config/ia64/vect.md (vec_initv2si): Renamed to ...
	(vec_initv2sisi): ... this.
	(vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	(vec_extractv2sf): Renamed to ...
	(vec_extractv2sfsf): ... this.
	* config/powerpcspe/vector.md (VEC_base_l): New mode attribute.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><VEC_base_l>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><VEC_base_l>): ... this.
	* config/powerpcspe/paired.md (vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	* config/powerpcspe/altivec.md (splitter, altivec_copysign_v4sf3,
	vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi,
	vec_unpacku_lo_v8hi, mulv16qi3): Add element mode after mode in
	gen_vec_init* calls.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64-simd.md
    trunk/gcc/config/aarch64/iterators.md
    trunk/gcc/config/arm/iterators.md
    trunk/gcc/config/arm/neon.md
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/mmx.md
    trunk/gcc/config/i386/sse.md
    trunk/gcc/config/ia64/vect.md
    trunk/gcc/config/mips/loongson.md
    trunk/gcc/config/mips/mips-msa.md
    trunk/gcc/config/mips/mips-ps-3d.md
    trunk/gcc/config/mips/mips.md
    trunk/gcc/config/powerpcspe/altivec.md
    trunk/gcc/config/powerpcspe/paired.md
    trunk/gcc/config/powerpcspe/vector.md
    trunk/gcc/config/rs6000/altivec.md
    trunk/gcc/config/rs6000/paired.md
    trunk/gcc/config/rs6000/vector.md
    trunk/gcc/config/s390/s390-builtins.def
    trunk/gcc/config/s390/s390.c
    trunk/gcc/config/s390/vector.md
    trunk/gcc/config/sparc/sparc.md
    trunk/gcc/config/spu/spu.c
    trunk/gcc/config/spu/spu.md
    trunk/gcc/doc/md.texi
    trunk/gcc/expmed.c
    trunk/gcc/expr.c
    trunk/gcc/optabs.c
    trunk/gcc/optabs.def
    trunk/gcc/tree-vect-stmts.c

Comment 13 Jakub Jelinek 2017-08-01 16:13:04 UTC

Author: jakub
Date: Tue Aug  1 16:12:31 2017
New Revision: 250784

URL: https://gcc.gnu.org/viewcvs?rev=250784&root=gcc&view=rev
Log:
	PR target/80846
	* config/rs6000/vsx.md (vextract_fp_from_shorth,
	vextract_fp_from_shortl): Add element mode after mode in gen_vec_init*
	calls.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/rs6000/vsx.md

Comment 14 Jakub Jelinek 2017-09-07 11:53:47 UTC

(In reply to Richard Biener from comment #11)
> that's not using the unpacking strategy (sum adjacent elements) but still the
> vector shift approach (add upper/lower halves).  That's sth that can be
> changed independently.
> 
> Waiting for final vec_extract/init2 optab settling.

That should be settled now.

BTW, for reductions in PR80324 I've added for avx512fintrin.h __MM512_REDUCE_OP which for reductions from 512-bit vectors uses smaller and smaller vectors, perhaps that is something we should use for the reductions emitted by the vectorizer too (perhaps through a target hook that would emit gimple for the reduction)?

Comment 15 rguenther@suse.de 2017-09-07 14:38:59 UTC

On September 7, 2017 1:53:47 PM GMT+02:00, "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
>
>--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
>(In reply to Richard Biener from comment #11)
>> that's not using the unpacking strategy (sum adjacent elements) but
>still the
>> vector shift approach (add upper/lower halves).  That's sth that can
>be
>> changed independently.
>> 
>> Waiting for final vec_extract/init2 optab settling.
>
>That should be settled now.
>
>BTW, for reductions in PR80324 I've added for avx512fintrin.h
>__MM512_REDUCE_OP
>which for reductions from 512-bit vectors uses smaller and smaller
>vectors,
>perhaps that is something we should use for the reductions emitted by
>the
>vectorizer too (perhaps through a target hook that would emit gimple
>for the
>reduction)?

Yeah, I have a patch that does this. The question is how to query the target if the vector sizes share the same register set. Like we wouldn't want to go to mmx register size.

Doing this would also allow to execute the adds for 512 to 128 bit reduction in parallel.

Comment 16 Jakub Jelinek 2017-09-07 15:34:54 UTC

(In reply to rguenther@suse.de from comment #15)
> Yeah, I have a patch that does this. The question is how to query the target
> if the vector sizes share the same register set. Like we wouldn't want to go
> to mmx register size.
> 
> Doing this would also allow to execute the adds for 512 to 128 bit reduction
> in parallel.

I wonder if we just shouldn't have a target hook that does all that (emits the best reduction sequence given original vector mode and operation), which could return NULL/false or something to be expanded by the generic code.  The middle-end doesn't have information about costs of the various permutations, preferences of vector types etc.

Comment 17 Aldy Hernandez 2017-09-13 16:11:19 UTC

Author: aldyh
Date: Wed Sep 13 16:10:45 2017
New Revision: 252207

URL: https://gcc.gnu.org/viewcvs?rev=252207&root=gcc&view=rev
Log:
	PR target/80846
	* optabs.def (vec_extract_optab, vec_init_optab): Change from
	a direct optab to conversion optab.
	* optabs.c (expand_vector_broadcast): Use convert_optab_handler
	with GET_MODE_INNER as last argument instead of optab_handler.
	* expmed.c (extract_bit_field_1): Likewise.  Use vector from
	vector extraction if possible and optab is available.
	* expr.c (store_constructor): Use convert_optab_handler instead
	of optab_handler.  Use vector initialization from smaller
	vectors if possible and optab is available.
	* tree-vect-stmts.c (vectorizable_load): Likewise.
	* doc/md.texi (vec_extract, vec_init): Document that the optabs
	now have two modes.
	* config/i386/i386.c (ix86_expand_vector_init): Handle expansion
	of vec_init from half-sized vectors with the same element mode.
	* config/i386/sse.md (ssehalfvecmode): Add V4TI case.
	(ssehalfvecmodelower, ssescalarmodelower): New mode attributes.
	(reduc_plus_scal_v8df, reduc_plus_scal_v4df, reduc_plus_scal_v2df,
	reduc_plus_scal_v16sf, reduc_plus_scal_v8sf, reduc_plus_scal_v4sf,
	reduc_<code>_scal_<mode>, reduc_umin_scal_v8hi): Add element mode
	after mode in gen_vec_extract* calls.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><ssescalarmodelower>): ... this.
	(vec_extract<mode><ssehalfvecmodelower>): New expander.
	(rotl<mode>3, rotr<mode>3, <shift_insn><mode>3, ashrv2di3): Add
	element mode after mode in gen_vec_init* calls.
	(VEC_INIT_HALF_MODE): New mode iterator.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><ssescalarmodelower>): ... this.
	(vec_init<mode><ssehalfvecmodelower>): New expander.
	* config/i386/mmx.md (vec_extractv2sf): Renamed to ...
	(vec_extractv2sfsf): ... this.
	(vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	(vec_extractv2si): Renamed to ...
	(vec_extractv2sisi): ... this.
	(vec_initv2si): Renamed to ...
	(vec_initv2sisi): ... this.
	(vec_extractv4hi): Renamed to ...
	(vec_extractv4hihi): ... this.
	(vec_initv4hi): Renamed to ...
	(vec_initv4hihi): ... this.
	(vec_extractv8qi): Renamed to ...
	(vec_extractv8qiqi): ... this.
	(vec_initv8qi): Renamed to ...
	(vec_initv8qiqi): ... this.
	* config/rs6000/vector.md (VEC_base_l): New mode attribute.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><VEC_base_l>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><VEC_base_l>): ... this.
	* config/rs6000/paired.md (vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	* config/rs6000/altivec.md (splitter, altivec_copysign_v4sf3,
	vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi,
	vec_unpacku_lo_v8hi, mulv16qi3, altivec_vreve<mode>2): Add
	element mode after mode in gen_vec_init* calls.
	* config/aarch64/aarch64-simd.md (vec_init<mode>): Renamed to ...
	(vec_init<mode><Vel>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><Vel>): ... this.
	* config/aarch64/iterators.md (Vel): New mode attribute.
	* config/s390/s390.c (s390_expand_vec_strlen, s390_expand_vec_movstr):
	Add element mode after mode in gen_vec_extract* calls.
	* config/s390/vector.md (non_vec_l): New mode attribute.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><non_vec_l>): ... this.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><non_vec_l>): ... this.
	* config/s390/s390-builtins.def (s390_vlgvb, s390_vlgvh, s390_vlgvf,
	s390_vlgvf_flt, s390_vlgvg, s390_vlgvg_dbl): Add element mode after
	vec_extract mode.
	* config/arm/iterators.md (V_elem_l): New mode attribute.
	* config/arm/neon.md (vec_extract<mode>): Renamed to ...
	(vec_extract<mode><V_elem_l>): ... this.
	(vec_extractv2di): Renamed to ...
	(vec_extractv2didi): ... this.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><V_elem_l>): ... this.
	(reduc_plus_scal_<mode>, reduc_plus_scal_v2di, reduc_smin_scal_<mode>,
	reduc_smax_scal_<mode>, reduc_umin_scal_<mode>,
	reduc_umax_scal_<mode>, neon_vget_lane<mode>, neon_vget_laneu<mode>):
	Add element mode after gen_vec_extract* calls.
	* config/mips/mips-msa.md (vec_init<mode>): Renamed to ...
	(vec_init<mode><unitmode>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><unitmode>): ... this.
	* config/mips/loongson.md (vec_init<mode>): Renamed to ...
	(vec_init<mode><unitmode>): ... this.
	* config/mips/mips-ps-3d.md (vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	(vec_extractv2sf): Renamed to ...
	(vec_extractv2sfsf): ... this.
	(reduc_plus_scal_v2sf, reduc_smin_scal_v2sf, reduc_smax_scal_v2sf):
	Add element mode after gen_vec_extract* calls.
	* config/mips/mips.md (unitmode): New mode iterator.
	* config/spu/spu.c (spu_expand_prologue, spu_allocate_stack,
	spu_builtin_extract): Add element mode after gen_vec_extract* calls.
	* config/spu/spu.md (inner_l): New mode attribute.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><inner_l>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><inner_l>): ... this.
	* config/sparc/sparc.md (veltmode): New mode iterator.
	(vec_init<VMALL:mode>): Renamed to ...
	(vec_init<VMALL:mode><VMALL:veltmode>): ... this.
	* config/ia64/vect.md (vec_initv2si): Renamed to ...
	(vec_initv2sisi): ... this.
	(vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	(vec_extractv2sf): Renamed to ...
	(vec_extractv2sfsf): ... this.
	* config/powerpcspe/vector.md (VEC_base_l): New mode attribute.
	(vec_init<mode>): Renamed to ...
	(vec_init<mode><VEC_base_l>): ... this.
	(vec_extract<mode>): Renamed to ...
	(vec_extract<mode><VEC_base_l>): ... this.
	* config/powerpcspe/paired.md (vec_initv2sf): Renamed to ...
	(vec_initv2sfsf): ... this.
	* config/powerpcspe/altivec.md (splitter, altivec_copysign_v4sf3,
	vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi,
	vec_unpacku_lo_v8hi, mulv16qi3): Add element mode after mode in
	gen_vec_init* calls.

Modified:
    branches/range-gen2/gcc/ChangeLog
    branches/range-gen2/gcc/config/aarch64/aarch64-simd.md
    branches/range-gen2/gcc/config/aarch64/iterators.md
    branches/range-gen2/gcc/config/arm/iterators.md
    branches/range-gen2/gcc/config/arm/neon.md
    branches/range-gen2/gcc/config/i386/i386.c
    branches/range-gen2/gcc/config/i386/mmx.md
    branches/range-gen2/gcc/config/i386/sse.md
    branches/range-gen2/gcc/config/ia64/vect.md
    branches/range-gen2/gcc/config/mips/loongson.md
    branches/range-gen2/gcc/config/mips/mips-msa.md
    branches/range-gen2/gcc/config/mips/mips-ps-3d.md
    branches/range-gen2/gcc/config/mips/mips.md
    branches/range-gen2/gcc/config/powerpcspe/altivec.md
    branches/range-gen2/gcc/config/powerpcspe/paired.md
    branches/range-gen2/gcc/config/powerpcspe/vector.md
    branches/range-gen2/gcc/config/rs6000/altivec.md
    branches/range-gen2/gcc/config/rs6000/paired.md
    branches/range-gen2/gcc/config/rs6000/vector.md
    branches/range-gen2/gcc/config/s390/s390-builtins.def
    branches/range-gen2/gcc/config/s390/s390.c
    branches/range-gen2/gcc/config/s390/vector.md
    branches/range-gen2/gcc/config/sparc/sparc.md
    branches/range-gen2/gcc/config/spu/spu.c
    branches/range-gen2/gcc/config/spu/spu.md
    branches/range-gen2/gcc/doc/md.texi
    branches/range-gen2/gcc/expmed.c
    branches/range-gen2/gcc/expr.c
    branches/range-gen2/gcc/optabs.c
    branches/range-gen2/gcc/optabs.def
    branches/range-gen2/gcc/tree-vect-stmts.c

Comment 18 Aldy Hernandez 2017-09-13 16:15:58 UTC

Author: aldyh
Date: Wed Sep 13 16:15:07 2017
New Revision: 252229

URL: https://gcc.gnu.org/viewcvs?rev=252229&root=gcc&view=rev
Log:
	PR target/80846
	* config/rs6000/vsx.md (vextract_fp_from_shorth,
	vextract_fp_from_shortl): Add element mode after mode in gen_vec_init*
	calls.

Modified:
    branches/range-gen2/gcc/ChangeLog
    branches/range-gen2/gcc/config/rs6000/vsx.md

Comment 19 Richard Biener 2018-01-12 11:43:44 UTC

Author: rguenth
Date: Fri Jan 12 11:43:13 2018
New Revision: 256576

URL: https://gcc.gnu.org/viewcvs?rev=256576&root=gcc&view=rev
Log:
2018-01-12  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/80846
	* target.def (split_reduction): New target hook.
	* targhooks.c (default_split_reduction): New function.
	* targhooks.h (default_split_reduction): Declare.
	* tree-vect-loop.c (vect_create_epilog_for_reduction): If the
	target requests first reduce vectors by combining low and high
	parts.
	* tree-vect-stmts.c (vect_gen_perm_mask_any): Adjust.
	(get_vectype_for_scalar_type_and_size): Export.
	* tree-vectorizer.h (get_vectype_for_scalar_type_and_size): Declare.

	* doc/tm.texi.in (TARGET_VECTORIZE_SPLIT_REDUCTION): Document.
	* doc/tm.texi: Regenerate.

	i386/
	* config/i386/i386.c (ix86_split_reduction): Implement
	TARGET_VECTORIZE_SPLIT_REDUCTION.

	* gcc.target/i386/pr80846-1.c: New testcase.
	* gcc.target/i386/pr80846-2.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr80846-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr80846-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/doc/tm.texi
    trunk/gcc/doc/tm.texi.in
    trunk/gcc/target.def
    trunk/gcc/targhooks.c
    trunk/gcc/targhooks.h
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-loop.c
    trunk/gcc/tree-vect-stmts.c
    trunk/gcc/tree-vectorizer.h

Comment 20 Richard Biener 2018-01-12 11:44:07 UTC

Fixed.

Comment 21 Peter Cordes 2018-01-14 12:45:40 UTC

(In reply to Richard Biener from comment #20)
> Fixed.

Unfortunately only fixed for integer, not FP.  The OpenMP and vanilla float array sum functions from the godbolt link in the initial bug report still use 256b shuffles, including a gratuitous vperm2f128 when the upper half isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of 8.

Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps instead of shuffle / vaddps).  Remember that Intel CPUs with AVX only have one FP shuffle unit.  (Including Sandy/Ivybridge, which has 2 integer-128 shuffle units)

float sumfloat_autovec(const float arr[]) {
  arr = __builtin_assume_aligned(arr, 64);
  float sum=0;
  for (int i=0 ; i<1024 ; i++)
    sum = sum + arr[i];
  return sum;
}

# gcc 20180113 -mavx2 -ffast-math -O3
#  (tune=generic, and even with arch=znver1 no-prefer-avx128)
        ...
        vhaddps %ymm0, %ymm0, %ymm0
        vhaddps %ymm0, %ymm0, %ymm1
        vperm2f128      $1, %ymm1, %ymm1, %ymm0   # why not vextract?
        vaddps  %ymm1, %ymm0, %ymm0               # gratuitous 256b
        vzeroupper

This bug is still present for FP code: it narrows from 256b to scalar only in the last step.

Every VHADDPS is 2 shuffles + 1 add on Intel.  They're in-lane shuffles, but it's still 2 uops for port5 vs. VSHUFPS + VADDPS.  (Costing an extra cycle of latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a vertical-add uop can't run in the same cycle.)  (V)HADDPS with the same input twice is almost never the best choice for performance.

On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel).  It's also 7c latency (vs. 3 for ADDPS).  256b VHADDPS ymm is 8 uops, one per 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some kind of penalty for ivec / fp domain crossing.  I guess the shuffles it uses internally are ivec domain?

With multiple threads on the same core, or even with ILP with surrounding code, uop throughput matters as well as latency, so more uops is worse even if it didn't have latency costs.

The sequence I'd recommend (for both Intel and AMD) is:
(See also http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026)


        vextractf128    $1, %ymm0, %xmm1
        vaddps          %xmm1, %xmm0, %xmm0          # narrow to 128b

        vmovshdup       %xmm0, %xmm0, %xmm1          # copy high->low in each pair
        vaddps          %xmm1, %xmm0, %xmm0

        vmovhlps        %xmm0, %xmm0, %xmm1          # duplicate high 64b
        vaddps          %xmm1, %xmm0, %xmm0

The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can do a 128b hsum with 4 instructions, with no MOVAPS.

Intel: 6 uops total, 3 shuffles.  vs. 8 total, 5 shuffles

AMD Ryzen: 6 uops, 3 shuffles.  vs. 26 total uops, 20 of them shuffles.  And much worse latency, too.

Even just fixing this specific bug without fixing the rest of the sequence would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps xmm is only half the uops of ymm.  (But the latency still sucks).

-----

Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without AVX.  PSHUFD or PSHUFLW to copy+shuffle is cheaper.  I guess I need to report that bug separately, because it probably won't get fixed soon: if I understand correctly, there's no mechanism for the back-end to tell the auto-vectorizer what shuffles it can do efficiently!

It usually won't make too much difference, but for very small arrays (like 8 `int` elements) the hsum is a big part of the cost, although it's probably still worth auto-vectorizing *if* you can do it efficiently.

Comment 22 Peter Cordes 2018-01-14 12:47:21 UTC

Forgot the Godbolt link with updated cmdline options: https://godbolt.org/g/FCZAEj.

Comment 23 Rainer Orth 2018-01-14 19:39:03 UTC

Created attachment 43122 [details]
i386-pc-solaris2.11 -m32 assembler output

Comment 24 Rainer Orth 2018-01-14 19:39:58 UTC

The new gcc.target/i386/pr80846-1.c testcase FAILs on Solaris/x86 (32 and 64-bit):

+FAIL: gcc.target/i386/pr80846-1.c scan-assembler-times vextracti 2 (found 1 times)

Assembler output attached.

  Rainer

Comment 25 Peter Cordes 2018-01-15 03:22:14 UTC

We're getting a spill/reload inside the loop with AVX512:

    .L2:
	vmovdqa64	(%esp), %zmm3
	vpaddd	(%eax), %zmm3, %zmm2
	addl	$64, %eax
	vmovdqa64	%zmm2, (%esp)
	cmpl	%eax, %edx
	jne	.L2

Loop finishes with the accumulator in memory *and* in ZMM2.  The copy in ZMM2 is ignored, and we get

    # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or VEXTRACTI64X4
	vmovdqa	32(%esp), %ymm5
	vpaddd	(%esp), %ymm5, %ymm0

    # braindead: vextracti128 can write a new reg instead of destroying xmm0
	vmovdqa	%xmm0, %xmm1
	vextracti128	$1, %ymm0, %xmm0
	vpaddd	%xmm0, %xmm1, %xmm0

        ... then a sane 128b hsum as expected, so at least that part went right.

Comment 26 Richard Biener 2018-01-15 10:17:51 UTC

(In reply to Peter Cordes from comment #25)
> We're getting a spill/reload inside the loop with AVX512:
> 
>     .L2:
> 	vmovdqa64	(%esp), %zmm3
> 	vpaddd	(%eax), %zmm3, %zmm2
> 	addl	$64, %eax
> 	vmovdqa64	%zmm2, (%esp)
> 	cmpl	%eax, %edx
> 	jne	.L2
> 
> Loop finishes with the accumulator in memory *and* in ZMM2.  The copy in
> ZMM2 is ignored, and we get
> 
>     # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or
> VEXTRACTI64X4
> 	vmovdqa	32(%esp), %ymm5
> 	vpaddd	(%esp), %ymm5, %ymm0
> 
>     # braindead: vextracti128 can write a new reg instead of destroying xmm0
> 	vmovdqa	%xmm0, %xmm1
> 	vextracti128	$1, %ymm0, %xmm0
> 	vpaddd	%xmm0, %xmm1, %xmm0
> 
>         ... then a sane 128b hsum as expected, so at least that part went
> right.

I filed PR83850 for this (I noticed this before committing).  This somehow
regressed in RA or the target.

Comment 27 Richard Biener 2018-01-15 10:25:47 UTC

(In reply to Peter Cordes from comment #21)
> (In reply to Richard Biener from comment #20)
> > Fixed.
> 
> Unfortunately only fixed for integer, not FP.  The OpenMP and vanilla float
> array sum functions from the godbolt link in the initial bug report still
> use 256b shuffles, including a gratuitous vperm2f128 when the upper half
> isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen
> instead of 8.
> 
> Even on Intel CPUs, they're optimized for code-size, not performance
> (vhaddps instead of shuffle / vaddps).  Remember that Intel CPUs with AVX
> only have one FP shuffle unit.  (Including Sandy/Ivybridge, which has 2
> integer-128 shuffle units)
> 
> float sumfloat_autovec(const float arr[]) {
>   arr = __builtin_assume_aligned(arr, 64);
>   float sum=0;
>   for (int i=0 ; i<1024 ; i++)
>     sum = sum + arr[i];
>   return sum;
> }
> 
> # gcc 20180113 -mavx2 -ffast-math -O3
> #  (tune=generic, and even with arch=znver1 no-prefer-avx128)
>         ...
>         vhaddps %ymm0, %ymm0, %ymm0
>         vhaddps %ymm0, %ymm0, %ymm1
>         vperm2f128      $1, %ymm1, %ymm1, %ymm0   # why not vextract?
>         vaddps  %ymm1, %ymm0, %ymm0               # gratuitous 256b
>         vzeroupper
> 
> This bug is still present for FP code: it narrows from 256b to scalar only
> in the last step.
> 
> Every VHADDPS is 2 shuffles + 1 add on Intel.  They're in-lane shuffles, but
> it's still 2 uops for port5 vs. VSHUFPS + VADDPS.  (Costing an extra cycle
> of latency because with only 1 shuffle port, the 2 interleave-shuffles that
> feed a vertical-add uop can't run in the same cycle.)  (V)HADDPS with the
> same input twice is almost never the best choice for performance.
> 
> On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel).
> It's also 7c latency (vs. 3 for ADDPS).  256b VHADDPS ymm is 8 uops, one per
> 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e.
> some kind of penalty for ivec / fp domain crossing.  I guess the shuffles it
> uses internally are ivec domain?
> 
> With multiple threads on the same core, or even with ILP with surrounding
> code, uop throughput matters as well as latency, so more uops is worse even
> if it didn't have latency costs.
> 
> The sequence I'd recommend (for both Intel and AMD) is:
> (See also
> http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-
> float-vector-sum-on-x86/35270026#35270026)
> 
> 
>         vextractf128    $1, %ymm0, %xmm1
>         vaddps          %xmm1, %xmm0, %xmm0          # narrow to 128b
> 
>         vmovshdup       %xmm0, %xmm0, %xmm1          # copy high->low in
> each pair
>         vaddps          %xmm1, %xmm0, %xmm0
> 
>         vmovhlps        %xmm0, %xmm0, %xmm1          # duplicate high 64b
>         vaddps          %xmm1, %xmm0, %xmm0
> 
> The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you
> can do a 128b hsum with 4 instructions, with no MOVAPS.
> 
> Intel: 6 uops total, 3 shuffles.  vs. 8 total, 5 shuffles
> 
> AMD Ryzen: 6 uops, 3 shuffles.  vs. 26 total uops, 20 of them shuffles.  And
> much worse latency, too.
> 
> Even just fixing this specific bug without fixing the rest of the sequence
> would help AMD *significantly*, because vextractf128 is very cheap, and
> vhaddps xmm is only half the uops of ymm.  (But the latency still sucks).

Note that this is deliberately left as-is because the target advertises
(cheap) support for horizontal reduction.  The vectorizer simply generates
a single statement for the reduction epilogue:

  <bb 4> [local count: 10737418]:
  stmp_sum_11.5_20 = REDUC_PLUS (vect_sum_11.4_6); [tail call]
  return stmp_sum_11.5_20;

so either the target shouldn't tell the vectorizer it supports this or
it simply needs to expand to better code.  Which means - can you open
a separate bug for this?  The backend currently does:

(define_expand "reduc_plus_scal_v8sf"
  [(match_operand:SF 0 "register_operand")
   (match_operand:V8SF 1 "register_operand")]
  "TARGET_AVX"
{
  rtx tmp = gen_reg_rtx (V8SFmode);
  rtx tmp2 = gen_reg_rtx (V8SFmode);
  rtx vec_res = gen_reg_rtx (V8SFmode);
  emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1]));
  emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp));
  emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1)));
  emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2));
  emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx));
  DONE;
})

changing it to your sequence shouldn't be too difficult.

> -----
> 
> Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get
> without AVX.  PSHUFD or PSHUFLW to copy+shuffle is cheaper.  I guess I need
> to report that bug separately, because it probably won't get fixed soon: if
> I understand correctly, there's no mechanism for the back-end to tell the
> auto-vectorizer what shuffles it can do efficiently!

Also shuffles on GIMPLE do not expose the targets micro-ops so this is a
pure target issue.  Please report separately.

> 
> It usually won't make too much difference, but for very small arrays (like 8
> `int` elements) the hsum is a big part of the cost, although it's probably
> still worth auto-vectorizing *if* you can do it efficiently.

So I'm closing this bug again.

Comment 28 Peter Cordes 2018-01-16 20:47:44 UTC

(In reply to Richard Biener from comment #27)
> Note that this is deliberately left as-is because the target advertises
> (cheap) support for horizontal reduction.  The vectorizer simply generates
> a single statement for the reduction epilogue:
>  [...]
> so either the target shouldn't tell the vectorizer it supports this or
> it simply needs to expand to better code.  Which means - can you open
> a separate bug for this?

Yes; I was incorrectly assuming the inefficient asm had the same cause as before.  I agree *this* is fixed, thanks for the explanation of how gcc was arriving at this sequence.

I'll have a look at the backend canned sequence defs and see if there are any other sub-optimal ones, or if it was only AVX.

Having canned sequences for different target instruction sets instead of leaving it to arch-independent code seems like it should be an improvement over the old design.