89922 – Loop on fixed size array is not unrolled and poorly optimized at -O2

Bug 89922 - Loop on fixed size array is not unrolled and poorly optimized at -O2

Summary: Loop on fixed size array is not unrolled and poorly optimized at -O2

Status:	RESOLVED DUPLICATE of bug 85747

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	9.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2019-04-02 10:49 UTC by Antony Polukhin
Modified:	2021-12-27 04:11 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-04-03 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Antony Polukhin 2019-04-02 10:49:14 UTC

Consider the example:


struct array {
   int data[5];
};

array test(int i) {
    array a = {1, i, 2, 3, 4};
    
    for (int j = 0; j < 5; ++j) {
      a.data[j] += j;
    }
    
    return a;
}


GCC-9 generates ~20 instructions with jmps.

Rewriting the same function with unrolled loop makes the assembly much better:

array test2(int i) {
    array a = {1, i, 2, 3, 4};
    a.data[0] += 0;
    a.data[1] += 1;
    a.data[2] += 2;
    a.data[3] += 3;
    a.data[4] += 4;
    
    return a;
}


Assembly for `test2` takes only ~8 instructions:
test2(int):
        add     esi, 1
        mov     DWORD PTR [rdi], 1
        mov     rax, rdi
        movabs  rdx, 25769803780
        mov     DWORD PTR [rdi+4], esi
        mov     QWORD PTR [rdi+8], rdx
        mov     DWORD PTR [rdi+16], 8
        ret

Comment 1 Richard Biener 2019-04-03 09:13:40 UTC

With -O3 I even see this vectorized to

_Z4testi:
.LFB0:
        .cfi_startproc
        movabsq $12884901890, %rdx
        movl    $1, (%rdi)
        movq    %rdi, %rax
        movq    %rdx, 8(%rdi)
        movl    %esi, 4(%rdi)
        movdqu  (%rdi), %xmm0
        paddd   .LC0(%rip), %xmm0
        movl    $8, 16(%rdi)
        movups  %xmm0, (%rdi)
        ret

but no jumps so you must use plain -O2?  Here unrolling is only done
if we estimate the code to not grow but the estimate is

  Loop size: 7
  Estimated size after unrolling: 9
Not unrolling loop 1: size would grow.

so you have to specify -funroll-loops where we get the desired

_Z4testi:
.LFB0:
        .cfi_startproc
        addl    $1, %esi
        movl    $1, (%rdi)
        movq    %rdi, %rax
        movabsq $25769803780, %rdx
        movl    %esi, 4(%rdi)
        movq    %rdx, 8(%rdi)
        movl    $8, 16(%rdi)
        ret

then.  For the unrolling heuristic it's hard to see the (partly)
constant initializer.

Comment 2 Antony Polukhin 2019-04-04 14:17:46 UTC

The estimation is very close to the actual result for the loop.

But it does not take into the account the instructions before the loop that are eliminated due to unrolling. Some heuristic like "initializing the local variable with goes away for unrolled loops if the variable is rewritten in loop or if the variable is not used outside the loop"

Comment 3 rguenther@suse.de 2019-04-05 07:47:45 UTC

On Thu, 4 Apr 2019, antoshkka at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922
> 
> --- Comment #2 from Antony Polukhin <antoshkka at gmail dot com> ---
> The estimation is very close to the actual result for the loop.
> 
> But it does not take into the account the instructions before the loop that are
> eliminated due to unrolling. Some heuristic like "initializing the local
> variable with goes away for unrolled loops if the variable is rewritten in loop
> or if the variable is not used outside the loop"

Sure, but you'd need to collect some function-scope information to feed
such heuristics to not fall on the wrong-side if those constraints are
not met.

Was the testcase just an artificial one or does it appear (in this
isolated form!) in a real application/benchmark?

Comment 4 Antony Polukhin 2019-04-05 16:56:22 UTC

> Was the testcase just an artificial one or does it appear (in this
> isolated form!) in a real application/benchmark?

I was not investigating a particular benchmark or real world application at first.

My guess is that heuristic will affect cryptography (initializing big arrays with magic constants) and math (matrix multiplication with identity matrix for example).

I've tried to check the validity of the guess. The very first attempt succeeded. Hash computation for a constant string is not well optimized: https://godbolt.org/z/iKi0pb The heuristic may notice that the string is a local variable and may force the loop unrolling. Hash computations on a constant variable is a common case in libstdc++ when working with unordered maps and sets.

There's definitely some room for improvement for cases when a local variable is used in the loop only.

Comment 5 JunMa 2019-04-10 06:23:03 UTC

the testcase in https://godbolt.org/z/iKi0pb is well optimized in gcc6.5 with O3, but not gcc7 and later. 
I have checked the gimple code dumped by optimized pass which are same.
The difference is done by rtl_cse1 pass.

Comment 6 Andrew Pinski 2021-12-22 09:35:25 UTC

Estimating sizes for loop 1
 BB: 3, after_exit: 0
  size:   1 _1 = <retval>.data[j_16];
  size:   1 _2 = _1 + j_16;
  size:   1 <retval>.data[j_16] = _2;
  size:   1 j_13 = j_16 + 1;
   Induction variable computation will be folded away.
  size:   1 ivtmp_4 = ivtmp_18 - 1;
   Induction variable computation will be folded away.
  size:   2 if (ivtmp_4 != 0)
   Exit condition will be eliminated in peeled copies.
   Exit condition will be eliminated in last copy.
   Constant conditional.
 BB: 5, after_exit: 1
size: 7-4, last_iteration: 7-4
  Loop size: 7
  Estimated size after unrolling: 10
Not unrolling loop 1: size would grow.

Comment 7 Andrew Pinski 2021-12-27 04:11:54 UTC

Dup of bug 85747.

*** This bug has been marked as a duplicate of bug 85747 ***