Bug 115777 - [12/13/14/15/16 regression] Severe performance regression on insertion sort at -O2 or above
Summary: [12/13/14/15/16 regression] Severe performance regression on insertion sort a...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 14.1.1
: P2 normal
Target Milestone: 12.5
Assignee: Richard Biener
URL:
Keywords: missed-optimization
: 117717 (view as bug list)
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-07-03 20:56 UTC by nrk
Modified: 2025-04-17 12:32 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-07-03 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nrk 2024-07-03 20:56:20 UTC
When compiled with -O2 or above, the following benchmark results in some severe performance regression.

	$ gcc -o test testcase.c -march=x86-64-v3 -O0 && ./test
	insertion_sort  =>      946
	$ gcc -o test testcase.c -march=x86-64-v3 -O1 && ./test
	insertion_sort  =>      552
	$ gcc -o test testcase.c -march=x86-64-v3 -O2 && ./test
	insertion_sort  =>     1536
	$ gcc -o test testcase.c -march=x86-64-v3 -O3 && ./test
	insertion_sort  =>     1525


Test & benchmark code below:


	#include <stddef.h>
	#include <stdint.h>
	
	#define T uint32_t
	#define SWAP(A, B) do { T tmp = A; A = B; B = tmp; } while (0)
	
	__attribute((noinline)) static void
	insertion_sort(T *v, ptrdiff_t n)
	{
		for (ptrdiff_t i = 1; i < n; ++i) {
			for (ptrdiff_t k = i; k > 0 && v[k-1] > v[k]; --k) {
				SWAP(v[k-1], v[k]);
			}
		}
	}
	
	/// benchmark
	#include <stdio.h>
	#include <time.h>
	
	int main(void)
	{
		enum {L=1<<10}; T v[L];
		uint64_t rng = 55555;
		for (int i = 0; i < L; ++i) {
			v[i] = (rng = rng * 1111111111111111111 + 0x1337);
		}
		clock_t beg = clock();
		insertion_sort(v, L);
		clock_t end = clock();
		printf("insertion_sort  => %8ld\n", (long)(end - beg));
	}
Comment 1 Sam James 2024-07-03 21:38:32 UTC
11 vs 12:
```
--- /dev/fd/63	2024-07-03 22:38:09.929726424 +0100
+++ /dev/fd/62	2024-07-03 22:38:09.932726456 +0100
@@ -5,24 +5,28 @@
 insertion_sort.constprop.0:
 .LFB25:
 	.cfi_startproc
-	leaq	4(%rdi), %r8
+	movq	%rdi, %r8
 	movl	$1, %esi
 	.p2align 4,,10
 	.p2align 3
 .L2:
-	movq	%r8, %rax
+	movq	%r8, %rdx
+	jmp	.L3
 	.p2align 4,,10
 	.p2align 3
+.L5:
+	vmovq	%xmm1, (%rdx)
+	leaq	-4(%rdx), %rax
+	cmpq	%rdi, %rdx
+	je	.L6
+	movq	%rax, %rdx
 .L3:
-	movl	-4(%rax), %edx
-	movl	(%rax), %ecx
-	cmpl	%edx, %ecx
-	jnb	.L6
-	movl	%ecx, -4(%rax)
-	subq	$4, %rax
-	movl	%edx, 4(%rax)
-	cmpq	%rax, %rdi
-	jne	.L3
+	vmovq	(%rdx), %xmm0
+	vmovd	%xmm0, %ecx
+	vpextrd	$1, %xmm0, %eax
+	vpshufd	$225, %xmm0, %xmm1
+	cmpl	%ecx, %eax
+	jb	.L5
 .L6:
 	addq	$1, %rsi
 	addq	$4, %r8
@@ -77,7 +81,7 @@
 	vzeroupper
 	call	clock@PLT
 	leaq	.LC0(%rip), %rsi
-	movl	$1, %edi
+	movl	$2, %edi
 	subq	%rbx, %rax
 	movq	%rax, %rdx
 	xorl	%eax, %eax
```
Comment 2 Andrew Pinski 2024-07-03 23:12:14 UTC
Looks like a cost model issue:
```
/app/example.cpp:13:5: note: Cost model analysis for part in loop 2:
  Vector cost: 44
  Scalar cost: 48
/app/example.cpp:13:5: note: Basic block will be vectorized using SLP
```


On aarch64:
```
/app/example.cpp:13:5: note: Cost model analysis for part in loop 2:
  Vector cost: 12
  Scalar cost: 4
/app/example.cpp:13:5: missed: not vectorized: vectorization is not profitable.

```
Comment 3 Richard Biener 2024-07-04 07:21:22 UTC
it's the usual issue of very high cost of (scalar) load and store and in this case very low cost on vector extract - that we even over-cost by a factor of
two because of the "duplicate" live stmt:

t.c:12:4: note: Costing subgraph:
t.c:12:4: note: node 0x51657c0 (max_nunits=2, refcnt=1) vector(2) unsigned int
t.c:12:4: note: op template: *_2 = _1;
t.c:12:4: note:         stmt 0 *_2 = _1;
t.c:12:4: note:         stmt 1 *_4 = _3;
t.c:12:4: note:         children 0x51658e0
t.c:12:4: note: node 0x51658e0 (max_nunits=1, refcnt=2) vector(2) unsigned int
t.c:12:4: note: op: VEC_PERM_EXPR
t.c:12:4: note:         [l] stmt 0 _1 = *_4;
t.c:12:4: note:         [l] stmt 1 _3 = *_2;
t.c:12:4: note:         lane permutation { 0[1] 0[0] }
t.c:12:4: note:         children 0x5165850 
t.c:12:4: note: node 0x5165850 (max_nunits=2, refcnt=1) vector(2) unsigned int
t.c:12:4: note: op template: _1 = *_4;
t.c:12:4: note:         [l] stmt 0 _3 = *_2;
t.c:12:4: note:         [l] stmt 1 _1 = *_4;

too bad that we end up using the lane extracted after the permute (that
gets us higher latency).

As with other BB vectorization opportunities it's difficult to estimate
the cost of tieing multiple independent data dependencies into a single
vector one and weight that against out-of-order independent execution.

To sum up, on the vectorizer side there's a bug costing the vector side
too much for the lane extracts (that makes the target cost issue even
more pronounced when fixed).

There are two problems with the vector code:

.L7:
        subq    $4, %rax
.L3:
        vmovq   (%rax), %xmm0
        vmovd   %xmm0, %edx
        vpextrd $1, %xmm0, %ecx
        cmpl    %edx, %ecx
        jnb     .L6
        vpshufd $225, %xmm0, %xmm0
        vmovq   %xmm0, (%rax)
        cmpq    %rdi, %rax
        jne     .L7

on zen4 the moves from vector to GPR are expensive.  But the most appearant
issue is that there's load-to-store forwarding conflicts with
storing 8 bytes to (%rax) with immediately loading 8 bytes from 4(%rax) in
the next iteration.  That's something the BB vectorizer does not consider
at all (it could look for data-ref evolution anticipating this in theory).

So trying to attack this from the cost modeling side is unlikely going
to cover this aspect of the issue.

Placing

#pragma GCC novector

before the inner loop is a workaround that works in GCC 14 and up.
Comment 4 Richard Biener 2024-10-13 13:33:39 UTC
Mine.
Comment 5 Andrew Pinski 2024-11-21 07:44:05 UTC
*** Bug 117717 has been marked as a duplicate of this bug. ***
Comment 6 Richard Biener 2024-11-21 07:56:20 UTC
I hope to get to this during stage3/4, I think it should be relatively easy to detect, not exactly sure where to place the actual mitigation yet.
Comment 7 Richard Biener 2025-01-09 10:38:17 UTC
So I tried the optimistic way to classify a problematic load as VMAT_ELEMENTWISE which for BB vectorization results in not vectorizing the SLP node but instead making it external, builting it from scalars.  That still makes vectorization profitable:

_7 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
*_6 1 times scalar_load costs 12 in body
*_3 1 times scalar_load costs 12 in body
node 0x3f1bf0b0 1 times vec_perm costs 4 in body
node 0x3f1bf020 1 times vec_construct costs 4 in prologue
_7 1 times unaligned_store (misalign -1) costs 12 in body
*_6 1 times vec_to_scalar costs 4 in epilogue
*_3 1 times vec_to_scalar costs 4 in epilogue
t.c:7:11: note: Cost model analysis for part in loop 2:
  Vector cost: 28
  Scalar cost: 48
t.c:7:11: note: Basic block will be vectorized using SLP

I think we falsely consider the permute node recoding the corresponding
scalar lanes as covering the scalar loads here not realizing we have to
keep them (also on the other side we think we have to extract both
lanes from the permute).  Fixing the first issue would reduce scalar
cost by 24, fixing both would reduce vector cost by 8 in the end still
trading a scalar store (12) for vector construction and permute (8).

The result is

insertion_sort  =>     1008

which is faster than with STLF fails

insertion_sort  =>     2333

but slower than w/o vectorization

insertion_sort  =>      181

        movl    (%rax), %ecx
        movl    4(%rax), %edx
        cmpl    %ecx, %edx
        jnb     .L6
        movd    %edx, %xmm0
        movd    %ecx, %xmm1
        punpckldq       %xmm1, %xmm0
        movq    %xmm0, (%rax)
        cmpq    %rdi, %rax
        jne     .L7

in backend costing we do anticipate the vector construction to happen
by loading from memory though, so we don't account for the extra
GPR->xmm move penalty.
Comment 8 Hongtao Liu 2025-01-09 14:58:49 UTC
> in backend costing we do anticipate the vector construction to happen
> by loading from memory though, so we don't account for the extra
> GPR->xmm move penalty.
Yes, I saw something similar before and had a patch, for ssa_name with multiple uses, there should be a GPR->XMM even it's from memory.
Comment 9 Richard Biener 2025-01-09 15:02:38 UTC
(In reply to Hongtao Liu from comment #8)
> > in backend costing we do anticipate the vector construction to happen
> > by loading from memory though, so we don't account for the extra
> > GPR->xmm move penalty.
> Yes, I saw something similar before and had a patch, for ssa_name with
> multiple uses, there should be a GPR->XMM even it's from memory.

That's probably the conservative answer for BB vectorization, for loop vect
we know all those uses will be also in vector code.  For BB vectorization
there is currently no easly reliable check to ensure this.
Comment 10 Hongtao Liu 2025-01-10 01:50:01 UTC
> That's probably the conservative answer for BB vectorization, for loop vect
> we know all those uses will be also in vector code.  For BB vectorization
> there is currently no easly reliable check to ensure this.

Oh, I see.
Comment 11 GCC Commits 2025-01-15 07:20:01 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:21edcb95340424efe2e66831f3b32cbab9cc6d31

commit r15-6905-g21edcb95340424efe2e66831f3b32cbab9cc6d31
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jan 9 11:51:19 2025 +0100

    Fix SLP scalar costing with stmts also used in externals
    
    When we have the situation of an external SLP node that is
    permuted the scalar stmts recorded in the permute node do not
    mean the scalar computation can be removed.  We are removing
    those stmts from the vectorized_scalar_stmts for this reason
    but we fail to check this set when we cost scalar stmts.  Note
    vectorized_scalar_stmts isn't a complete set so also pass
    scalar_stmts_in_externs and check that.
    
    The following fixes this.
    
    This shows in PR115777 when we avoid vectorizing the load, but
    on it's own doesn't help the PR yet.
    
            PR tree-optimization/115777
            * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Do not
            cost a scalar stmt that needs to be preserved.