87505 – Vectorizer generates a lot of code for a small loop

Bug 87505 - Vectorizer generates a lot of code for a small loop

Summary: Vectorizer generates a lot of code for a small loop

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	unknown

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-10-03 16:19 UTC by AK
Modified:	2018-10-04 10:05 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description AK 2018-10-03 16:19:48 UTC

test.cpp

#include <cstdint>

int bar(int* v, std::size_t base) {
    int sum = 0;
    for (int i = base; i < base + 4; ++i) {
        sum += v[i];
    }
    return sum;
}


$ gcc-8.2 -std=c++17 -O3 -DNDEBUG test.cpp

bar(int*, unsigned long):
        movslq  %esi, %rcx
        leaq    4(%rsi), %r8
        movl    %esi, %edx
        cmpq    %r8, %rcx
        jnb     .L7
        leaq    3(%rsi), %rax
        movq    %r8, %r9
        subq    %rcx, %rax
        subq    %rcx, %r9
        cmpq    $3, %rax
        jbe     .L8
        movq    %r9, %rdx
        leaq    (%rdi,%rcx,4), %rax
        pxor    %xmm0, %xmm0
        shrq    $2, %rdx
        salq    $4, %rdx
        addq    %rax, %rdx
.L5:
        movdqu  (%rax), %xmm2
        addq    $16, %rax
        paddd   %xmm2, %xmm0
        cmpq    %rdx, %rax
        jne     .L5
        movdqa  %xmm0, %xmm1
        movq    %r9, %r10
        psrldq  $8, %xmm1
        andq    $-4, %r10
        paddd   %xmm1, %xmm0
        addq    %r10, %rcx
        leal    (%rsi,%r10), %edx
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        paddd   %xmm1, %xmm0
        movd    %xmm0, %eax
        cmpq    %r10, %r9
        je      .L10
.L3:
        addl    (%rdi,%rcx,4), %eax
        leal    1(%rdx), %ecx
        movslq  %ecx, %rcx
        cmpq    %r8, %rcx
        jnb     .L1
        addl    (%rdi,%rcx,4), %eax
        leal    2(%rdx), %ecx
        movslq  %ecx, %rcx
        cmpq    %rcx, %r8
        jbe     .L1
        addl    $3, %edx
        addl    (%rdi,%rcx,4), %eax
        movslq  %edx, %rdx
        cmpq    %rdx, %r8
        jbe     .L1
        addl    (%rdi,%rdx,4), %eax
        ret
.L7:
        xorl    %eax, %eax
.L1:
        ret
.L10:
        ret
.L8:
        xorl    %eax, %eax
        jmp     .L3

Comment 1 Alexander Monakov 2018-10-03 16:50:44 UTC

This is because 'i' is int while 'base' is size_t. Loop init expression 'int i = base' truncates base to 32 bits, and loop condition first converts i to size_t and then compares in that unsigned type. It's not exactly obvious how many iterations this loop has :)

Using the proper type for 'i' results in reasonable code.

Perhaps in principle niter analysis could handle this anyway?

Comment 2 Richard Biener 2018-10-04 10:04:58 UTC

Hmm, we compute the loop iterates

((long unsigned int) base_9(D) - (long unsigned int) (int) base_9(D)) + 3

times.  You can probably spot the cases of INT_MAX < base < UINT_MAX
not iterating at all (i is sign-extended to std::size_t for the comparison)
and base > UINT_MAX where it iterates quite a lot (eventually).

We have to account for these cases.  If you make the suggested adjustment
then of course we know the loop always iterates 4 times.

Unless I missed something of course.

Comment 3 Richard Biener 2018-10-04 10:05:59 UTC

Both making i std::size_t or casting (base + 4) to int "fixes" this.