When compiled with -O2 or above, the following benchmark results in some severe performance regression. $ gcc -o test testcase.c -march=x86-64-v3 -O0 && ./test insertion_sort => 946 $ gcc -o test testcase.c -march=x86-64-v3 -O1 && ./test insertion_sort => 552 $ gcc -o test testcase.c -march=x86-64-v3 -O2 && ./test insertion_sort => 1536 $ gcc -o test testcase.c -march=x86-64-v3 -O3 && ./test insertion_sort => 1525 Test & benchmark code below: #include <stddef.h> #include <stdint.h> #define T uint32_t #define SWAP(A, B) do { T tmp = A; A = B; B = tmp; } while (0) __attribute((noinline)) static void insertion_sort(T *v, ptrdiff_t n) { for (ptrdiff_t i = 1; i < n; ++i) { for (ptrdiff_t k = i; k > 0 && v[k-1] > v[k]; --k) { SWAP(v[k-1], v[k]); } } } /// benchmark #include <stdio.h> #include <time.h> int main(void) { enum {L=1<<10}; T v[L]; uint64_t rng = 55555; for (int i = 0; i < L; ++i) { v[i] = (rng = rng * 1111111111111111111 + 0x1337); } clock_t beg = clock(); insertion_sort(v, L); clock_t end = clock(); printf("insertion_sort => %8ld\n", (long)(end - beg)); }
11 vs 12: ``` --- /dev/fd/63 2024-07-03 22:38:09.929726424 +0100 +++ /dev/fd/62 2024-07-03 22:38:09.932726456 +0100 @@ -5,24 +5,28 @@ insertion_sort.constprop.0: .LFB25: .cfi_startproc - leaq 4(%rdi), %r8 + movq %rdi, %r8 movl $1, %esi .p2align 4,,10 .p2align 3 .L2: - movq %r8, %rax + movq %r8, %rdx + jmp .L3 .p2align 4,,10 .p2align 3 +.L5: + vmovq %xmm1, (%rdx) + leaq -4(%rdx), %rax + cmpq %rdi, %rdx + je .L6 + movq %rax, %rdx .L3: - movl -4(%rax), %edx - movl (%rax), %ecx - cmpl %edx, %ecx - jnb .L6 - movl %ecx, -4(%rax) - subq $4, %rax - movl %edx, 4(%rax) - cmpq %rax, %rdi - jne .L3 + vmovq (%rdx), %xmm0 + vmovd %xmm0, %ecx + vpextrd $1, %xmm0, %eax + vpshufd $225, %xmm0, %xmm1 + cmpl %ecx, %eax + jb .L5 .L6: addq $1, %rsi addq $4, %r8 @@ -77,7 +81,7 @@ vzeroupper call clock@PLT leaq .LC0(%rip), %rsi - movl $1, %edi + movl $2, %edi subq %rbx, %rax movq %rax, %rdx xorl %eax, %eax ```
Looks like a cost model issue: ``` /app/example.cpp:13:5: note: Cost model analysis for part in loop 2: Vector cost: 44 Scalar cost: 48 /app/example.cpp:13:5: note: Basic block will be vectorized using SLP ``` On aarch64: ``` /app/example.cpp:13:5: note: Cost model analysis for part in loop 2: Vector cost: 12 Scalar cost: 4 /app/example.cpp:13:5: missed: not vectorized: vectorization is not profitable. ```
it's the usual issue of very high cost of (scalar) load and store and in this case very low cost on vector extract - that we even over-cost by a factor of two because of the "duplicate" live stmt: t.c:12:4: note: Costing subgraph: t.c:12:4: note: node 0x51657c0 (max_nunits=2, refcnt=1) vector(2) unsigned int t.c:12:4: note: op template: *_2 = _1; t.c:12:4: note: stmt 0 *_2 = _1; t.c:12:4: note: stmt 1 *_4 = _3; t.c:12:4: note: children 0x51658e0 t.c:12:4: note: node 0x51658e0 (max_nunits=1, refcnt=2) vector(2) unsigned int t.c:12:4: note: op: VEC_PERM_EXPR t.c:12:4: note: [l] stmt 0 _1 = *_4; t.c:12:4: note: [l] stmt 1 _3 = *_2; t.c:12:4: note: lane permutation { 0[1] 0[0] } t.c:12:4: note: children 0x5165850 t.c:12:4: note: node 0x5165850 (max_nunits=2, refcnt=1) vector(2) unsigned int t.c:12:4: note: op template: _1 = *_4; t.c:12:4: note: [l] stmt 0 _3 = *_2; t.c:12:4: note: [l] stmt 1 _1 = *_4; too bad that we end up using the lane extracted after the permute (that gets us higher latency). As with other BB vectorization opportunities it's difficult to estimate the cost of tieing multiple independent data dependencies into a single vector one and weight that against out-of-order independent execution. To sum up, on the vectorizer side there's a bug costing the vector side too much for the lane extracts (that makes the target cost issue even more pronounced when fixed). There are two problems with the vector code: .L7: subq $4, %rax .L3: vmovq (%rax), %xmm0 vmovd %xmm0, %edx vpextrd $1, %xmm0, %ecx cmpl %edx, %ecx jnb .L6 vpshufd $225, %xmm0, %xmm0 vmovq %xmm0, (%rax) cmpq %rdi, %rax jne .L7 on zen4 the moves from vector to GPR are expensive. But the most appearant issue is that there's load-to-store forwarding conflicts with storing 8 bytes to (%rax) with immediately loading 8 bytes from 4(%rax) in the next iteration. That's something the BB vectorizer does not consider at all (it could look for data-ref evolution anticipating this in theory). So trying to attack this from the cost modeling side is unlikely going to cover this aspect of the issue. Placing #pragma GCC novector before the inner loop is a workaround that works in GCC 14 and up.
Mine.
*** Bug 117717 has been marked as a duplicate of this bug. ***
I hope to get to this during stage3/4, I think it should be relatively easy to detect, not exactly sure where to place the actual mitigation yet.
So I tried the optimistic way to classify a problematic load as VMAT_ELEMENTWISE which for BB vectorization results in not vectorizing the SLP node but instead making it external, builting it from scalars. That still makes vectorization profitable: _7 1 times scalar_store costs 12 in body _4 1 times scalar_store costs 12 in body *_6 1 times scalar_load costs 12 in body *_3 1 times scalar_load costs 12 in body node 0x3f1bf0b0 1 times vec_perm costs 4 in body node 0x3f1bf020 1 times vec_construct costs 4 in prologue _7 1 times unaligned_store (misalign -1) costs 12 in body *_6 1 times vec_to_scalar costs 4 in epilogue *_3 1 times vec_to_scalar costs 4 in epilogue t.c:7:11: note: Cost model analysis for part in loop 2: Vector cost: 28 Scalar cost: 48 t.c:7:11: note: Basic block will be vectorized using SLP I think we falsely consider the permute node recoding the corresponding scalar lanes as covering the scalar loads here not realizing we have to keep them (also on the other side we think we have to extract both lanes from the permute). Fixing the first issue would reduce scalar cost by 24, fixing both would reduce vector cost by 8 in the end still trading a scalar store (12) for vector construction and permute (8). The result is insertion_sort => 1008 which is faster than with STLF fails insertion_sort => 2333 but slower than w/o vectorization insertion_sort => 181 movl (%rax), %ecx movl 4(%rax), %edx cmpl %ecx, %edx jnb .L6 movd %edx, %xmm0 movd %ecx, %xmm1 punpckldq %xmm1, %xmm0 movq %xmm0, (%rax) cmpq %rdi, %rax jne .L7 in backend costing we do anticipate the vector construction to happen by loading from memory though, so we don't account for the extra GPR->xmm move penalty.
> in backend costing we do anticipate the vector construction to happen > by loading from memory though, so we don't account for the extra > GPR->xmm move penalty. Yes, I saw something similar before and had a patch, for ssa_name with multiple uses, there should be a GPR->XMM even it's from memory.
(In reply to Hongtao Liu from comment #8) > > in backend costing we do anticipate the vector construction to happen > > by loading from memory though, so we don't account for the extra > > GPR->xmm move penalty. > Yes, I saw something similar before and had a patch, for ssa_name with > multiple uses, there should be a GPR->XMM even it's from memory. That's probably the conservative answer for BB vectorization, for loop vect we know all those uses will be also in vector code. For BB vectorization there is currently no easly reliable check to ensure this.
> That's probably the conservative answer for BB vectorization, for loop vect > we know all those uses will be also in vector code. For BB vectorization > there is currently no easly reliable check to ensure this. Oh, I see.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:21edcb95340424efe2e66831f3b32cbab9cc6d31 commit r15-6905-g21edcb95340424efe2e66831f3b32cbab9cc6d31 Author: Richard Biener <rguenther@suse.de> Date: Thu Jan 9 11:51:19 2025 +0100 Fix SLP scalar costing with stmts also used in externals When we have the situation of an external SLP node that is permuted the scalar stmts recorded in the permute node do not mean the scalar computation can be removed. We are removing those stmts from the vectorized_scalar_stmts for this reason but we fail to check this set when we cost scalar stmts. Note vectorized_scalar_stmts isn't a complete set so also pass scalar_stmts_in_externs and check that. The following fixes this. This shows in PR115777 when we avoid vectorizing the load, but on it's own doesn't help the PR yet. PR tree-optimization/115777 * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Do not cost a scalar stmt that needs to be preserved.