110692 – epilogues for loop which can be also vectorized with half size can be improved.

Bug 110692 - epilogues for loop which can be also vectorized with half size can be improved.

Summary: epilogues for loop which can be also vectorized with half size can be improved.

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	13.1.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2023-07-16 19:31 UTC by Jan Hubicka
Modified:	2023-07-17 07:45 UTC (History)
CC List:	1 user (show)

See Also:	110649
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jan Hubicka 2023-07-16 19:31:27 UTC

for:
int a[99];
test()
{
        for (int i = 0 ; i < 99; i++)
                a[i]++;
}
we produce:
   0:   66 0f 6f 0d 00 00 00    movdqa 0x0(%rip),%xmm1        # 8 <test+0x8>
   7:   00 
   8:   b8 00 00 00 00          mov    $0x0,%eax
   d:   ba 00 00 00 00          mov    $0x0,%edx
  12:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  18:   66 0f 6f 00             movdqa (%rax),%xmm0
  1c:   48 83 c0 10             add    $0x10,%rax
  20:   66 0f fe c1             paddd  %xmm1,%xmm0
  24:   0f 29 40 f0             movaps %xmm0,-0x10(%rax)
  28:   48 39 c2                cmp    %rax,%rdx
  2b:   75 eb                   jne    18 <test+0x18>
  2d:   f3 0f 7e 05 00 00 00    movq   0x0(%rip),%xmm0        # 35 <test+0x35>
  34:   00 
  35:   83 05 00 00 00 00 01    addl   $0x1,0x0(%rip)        # 3c <test+0x3c>
  3c:   f3 0f 7e 0d 00 00 00    movq   0x0(%rip),%xmm1        # 44 <test+0x44>
  43:   00 
  44:   66 0f fe c1             paddd  %xmm1,%xmm0
  48:   66 0f d6 05 00 00 00    movq   %xmm0,0x0(%rip)        # 50 <test+0x50>
  4f:   00 
  50:   c3                      ret

which does the 4x vectorized loop followed by 2x vectorized loopless epilogue and copy of remaining byte.

When bound is unknow we do:
int a[99];
test(int n)
{
        for (int i = 0 ; i < n; i++)
                a[i]++;
}
   0:   85 ff                   test   %edi,%edi
   2:   7e 70                   jle    74 <test+0x74>
   4:   8d 47 ff                lea    -0x1(%rdi),%eax
   7:   83 f8 02                cmp    $0x2,%eax
   a:   76 6d                   jbe    79 <test+0x79>
   c:   89 fa                   mov    %edi,%edx
   e:   66 0f 6f 0d 00 00 00    movdqa 0x0(%rip),%xmm1        # 16 <test+0x16>
  15:   00
  16:   31 c0                   xor    %eax,%eax
  18:   c1 ea 02                shr    $0x2,%edx
  1b:   48 c1 e2 04             shl    $0x4,%rdx
  1f:   66 0f 6f 80 00 00 00    movdqa 0x0(%rax),%xmm0
  26:   00
  27:   48 83 c0 10             add    $0x10,%rax
  2b:   66 0f fe c1             paddd  %xmm1,%xmm0
  2f:   0f 29 80 00 00 00 00    movaps %xmm0,0x0(%rax)
  36:   48 39 d0                cmp    %rdx,%rax
  39:   75 e4                   jne    1f <test+0x1f>
  3b:   89 f8                   mov    %edi,%eax
  3d:   83 e0 fc                and    $0xfffffffc,%eax
  40:   40 f6 c7 03             test   $0x3,%dil
  44:   74 32                   je     78 <test+0x78>
  46:   48 63 d0                movslq %eax,%rdx
  49:   83 04 95 00 00 00 00    addl   $0x1,0x0(,%rdx,4)
  50:   01
  51:   8d 50 01                lea    0x1(%rax),%edx
  54:   39 d7                   cmp    %edx,%edi
  56:   7e 1c                   jle    74 <test+0x74>
  58:   48 63 d2                movslq %edx,%rdx
  5b:   83 c0 02                add    $0x2,%eax
  5e:   83 04 95 00 00 00 00    addl   $0x1,0x0(,%rdx,4)
  65:   01
  66:   39 c7                   cmp    %eax,%edi
  68:   7e 0a                   jle    74 <test+0x74>
  6a:   48 98                   cltq
  6c:   83 04 85 00 00 00 00    addl   $0x1,0x0(,%rax,4)
  73:   01
  74:   c3                      ret
  75:   0f 1f 00                nopl   (%rax)
  78:   c3                      ret
  79:   31 c0                   xor    %eax,%eax
  7b:   eb c9                   jmp    46 <test+0x46>

the profitability threshold is 4.
Producing the loopless epilogue just as in the first case with additional tests for block size would work better. The code looks quite bad for small trip counts since there is extra jump down to 79.

Comment 1 Richard Biener 2023-07-17 07:45:23 UTC

With the variable bound we have

a[i_11] 1 times vector_load costs 12 in body
_1 + 1 1 times scalar_to_vec costs 4 in prologue
_1 + 1 1 times vector_stmt costs 4 in body
_2 1 times vector_store costs 12 in body
t.c:4:28: note:  operating on full vectors for epilogue loop.
t.c:4:28: note:  cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
a[i_11] 1 times scalar_load costs 12 in epilogue
_1 + 1 1 times scalar_stmt costs 4 in epilogue
_2 1 times scalar_store costs 12 in epilogue
<unknown> 1 times cond_branch_taken costs 16 in epilogue
t.c:4:28: note:  Cost model analysis:
  Vector inside of loop cost: 28
  Vector prologue cost: 4
  Vector epilogue cost: 44
  Scalar iteration cost: 28
  Scalar outside cost: 32
  Vector outside cost: 48
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 1
t.c:4:28: note:    Runtime profitability threshold = 2
t.c:4:28: note:    Static estimate profitability threshold = 4
t.c:4:28: missed:  not vectorized: estimated iteration count too small.

with the static bound:

a[i_10] 1 times vector_load costs 12 in body
_1 + 1 1 times scalar_to_vec costs 4 in prologue
_1 + 1 1 times vector_stmt costs 4 in body
_2 1 times vector_store costs 12 in body
t.c:4:28: note:  operating on full vectors for epilogue loop.
a[i_10] 1 times scalar_load costs 12 in epilogue
_1 + 1 1 times scalar_stmt costs 4 in epilogue
_2 1 times scalar_store costs 12 in epilogue
t.c:4:28: note:  Cost model analysis:
  Vector inside of loop cost: 28
  Vector prologue cost: 4
  Vector epilogue cost: 28
  Scalar iteration cost: 28
  Scalar outside cost: 0
  Vector outside cost: 32
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2
t.c:4:28: note:    Runtime profitability threshold = 2
t.c:4:28: note:    Static estimate profitability threshold = 2

so it's the branch of the epilog of the epilog that ups the cost, not sure
whether in a reasonable way for this case.  In the end I think if the
count is unknown a fully peeled epilog with 2 iterations is a reasonable
implementation.

We could decide that costing the epilogue vectorization isn't worthwhile
but note we are not implementing all 8 byte vector ops with the same
efficiency as the 16 byte ones.