typedef float real_t; #define iterations 100000 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D]; int main () { // loop rerolling // unrolled dot product real_t dot; for (int nl = 0; nl < 8*iterations; nl++) { dot = (real_t)0.; for (int i = 0; i < LEN_1D; i += 5) { dot = dot + a[i] * b[i] + a[i + 1] * b[i + 1] + a[i + 2] * b[i + 2] + a[i + 3] * b[i + 3] + a[i + 4] * b[i + 4]; } } return dot; } clang does: main: # @main .cfi_startproc # %bb.0: xorl %eax, %eax .p2align 4, 0x90 .LBB0_1: # =>This Loop Header: Depth=1 # Child Loop BB0_2 Depth 2 vxorps %xmm0, %xmm0, %xmm0 movq $-5, %rcx .p2align 4, 0x90 .LBB0_2: # Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vmovups b+20(,%rcx,4), %xmm1 vmovss b+36(,%rcx,4), %xmm2 # xmm2 = mem[0],zero,zero,zero vmulps a+20(,%rcx,4), %xmm1, %xmm1 vpermilpd $1, %xmm1, %xmm3 # xmm3 = xmm1[1,0] vaddps %xmm3, %xmm1, %xmm1 vmovshdup %xmm1, %xmm3 # xmm3 = xmm1[1,1,3,3] vaddss %xmm3, %xmm1, %xmm1 vfmadd231ss a+36(,%rcx,4), %xmm2, %xmm1 # xmm1 = (xmm2 * mem) + xmm1 addq $5, %rcx vaddss %xmm0, %xmm1, %xmm0 cmpq $31995, %rcx # imm = 0x7CFB jb .LBB0_2 # %bb.3: # in Loop: Header=BB0_1 Depth=1 incl %eax cmpl $800000, %eax # imm = 0xC3500 jne .LBB0_1 # %bb.4: vcvttss2si %xmm0, %eax retq
With -fno-tree-reassoc we detect the reduction chain and produce .L3: vmovaps b(%rax), %ymm5 vmovaps b+32(%rax), %ymm6 addq $160, %rax vfmadd231ps a-160(%rax), %ymm5, %ymm1 vmovaps b-96(%rax), %ymm7 vfmadd231ps a-128(%rax), %ymm6, %ymm0 vmovaps b-64(%rax), %ymm5 vmovaps b-32(%rax), %ymm6 vfmadd231ps a-96(%rax), %ymm7, %ymm2 vfmadd231ps a-64(%rax), %ymm5, %ymm3 vfmadd231ps a-32(%rax), %ymm6, %ymm4 cmpq $128000, %rax jne .L3 vaddps %ymm1, %ymm0, %ymm0 vaddps %ymm2, %ymm0, %ymm0 vaddps %ymm3, %ymm0, %ymm0 vaddps %ymm4, %ymm0, %ymm0 vextractf128 $0x1, %ymm0, %xmm1 vaddps %xmm0, %xmm1, %xmm1 vmovhlps %xmm1, %xmm1, %xmm0 vaddps %xmm1, %xmm0, %xmm0 vshufps $85, %xmm0, %xmm0, %xmm1 vaddps %xmm0, %xmm1, %xmm0 decl %edx jne .L2 we're not re-rolling and thus are forced to use a VF of 4 here. Note that LLVM doesn't seem to veectorize the loop but instead vectorizes the basic-block which isn't what TSVC looks for (but that would work for non-fast-math).
This is also seen with zen4 comparing gcc and aocc. (about 2.3 times differnece)
(In reply to Jan Hubicka from comment #2) > This is also seen with zen4 comparing gcc and aocc. (about 2.3 times > differnece) Disabling @@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb) binary op are chosen wisely. */ int len = ops.length (); if (len >= 3) swap_ops_for_binary_stmt (ops, len - 3, stmt); will naturally create the reduction chain (or leave it in place) given the current rank computation. We do have (somewhat) robust fallback from reduction chain to reduction (via reduction path support), so I think this change would be OK.
(In reply to Richard Biener from comment #3) > (In reply to Jan Hubicka from comment #2) > > This is also seen with zen4 comparing gcc and aocc. (about 2.3 times > > differnece) > > Disabling > > @@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb) > binary op are chosen wisely. */ > int len = ops.length (); > if (len >= 3) > swap_ops_for_binary_stmt (ops, len - 3, stmt); > > will naturally create the reduction chain (or leave it in place) given the > current rank computation. We do have (somewhat) robust fallback from > reduction chain to reduction (via reduction path support), so I think this > change would be OK. The code originated from r0-111616-gdf7b0cc4aae062, the reassoc rewrite by Jeff back in 2005 for GCC 4.3 (or 4.2, don't have that tree around anymore).
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:b073f2b098ba7819450d6c14a0fb96cb1c09f242 commit r13-5122-gb073f2b098ba7819450d6c14a0fb96cb1c09f242 Author: Richard Biener <rguenther@suse.de> Date: Thu Jan 12 11:18:22 2023 +0100 tree-optimization/99412 - reassoc and reduction chains With -ffast-math we end up associating reduction chains and break them - this is because of old code that tries to rectify reductions into a shape likened by the vectorizer. Nowadays the rank compute produces correct association for reduction chains and the vectorizer has robust support to fall back to a regular reductions (via reduction path) when it turns out to be not a proper reduction chain. So this patch removes the special code in reassoc which makes the TSVC s352 vectorized with -Ofast (it is already without -ffast-math). PR tree-optimization/99412 * tree-ssa-reassoc.cc (is_phi_for_stmt): Remove. (swap_ops_for_binary_stmt): Remove reduction handling. (rewrite_expr_tree_parallel): Adjust. (reassociate_bb): Likewise. * tree-parloops.cc (build_new_reduction): Handle MINUS_EXPR. * gcc.dg/vect/pr99412.c: New testcase. * gcc.dg/tree-ssa/reassoc-47.c: Adjust comment. * gcc.dg/tree-ssa/reassoc-48.c: Remove.
GCC now does .L2: vmovaps b(%rax), %xmm6 vmulps a(%rax), %xmm6, %xmm0 addq $80, %rax vmovaps b-64(%rax), %xmm7 vmovaps b-48(%rax), %xmm6 vaddps %xmm0, %xmm5, %xmm5 vmulps a-64(%rax), %xmm7, %xmm0 vmovaps b-32(%rax), %xmm7 vaddps %xmm0, %xmm1, %xmm1 vmulps a-48(%rax), %xmm6, %xmm0 vmovaps b-16(%rax), %xmm6 vaddps %xmm0, %xmm4, %xmm4 vmulps a-32(%rax), %xmm7, %xmm0 vaddps %xmm0, %xmm2, %xmm2 vmulps a-16(%rax), %xmm6, %xmm0 vaddps %xmm0, %xmm3, %xmm3 cmpq $128000, %rax jne .L2 thus uses a VF of 4 with -Ofast. Your LLVM snippet uses 5 lanes instead of our 20 with four lanes in a V4SF and one lane in a scalar. That's interesting but not something we support. Re-rolling would mean using a single v4sf 4 lane vector here. For a pure SLP loop something like this should be possible without too much hassle I think. We'd just need to try ... (and think of if it's worth in real life) For the purpose of the Summary this is fixed now.