Consider the example: struct array { int data[5]; }; array test(int i) { array a = {1, i, 2, 3, 4}; for (int j = 0; j < 5; ++j) { a.data[j] += j; } return a; } GCC-9 generates ~20 instructions with jmps. Rewriting the same function with unrolled loop makes the assembly much better: array test2(int i) { array a = {1, i, 2, 3, 4}; a.data[0] += 0; a.data[1] += 1; a.data[2] += 2; a.data[3] += 3; a.data[4] += 4; return a; } Assembly for `test2` takes only ~8 instructions: test2(int): add esi, 1 mov DWORD PTR [rdi], 1 mov rax, rdi movabs rdx, 25769803780 mov DWORD PTR [rdi+4], esi mov QWORD PTR [rdi+8], rdx mov DWORD PTR [rdi+16], 8 ret
With -O3 I even see this vectorized to _Z4testi: .LFB0: .cfi_startproc movabsq $12884901890, %rdx movl $1, (%rdi) movq %rdi, %rax movq %rdx, 8(%rdi) movl %esi, 4(%rdi) movdqu (%rdi), %xmm0 paddd .LC0(%rip), %xmm0 movl $8, 16(%rdi) movups %xmm0, (%rdi) ret but no jumps so you must use plain -O2? Here unrolling is only done if we estimate the code to not grow but the estimate is Loop size: 7 Estimated size after unrolling: 9 Not unrolling loop 1: size would grow. so you have to specify -funroll-loops where we get the desired _Z4testi: .LFB0: .cfi_startproc addl $1, %esi movl $1, (%rdi) movq %rdi, %rax movabsq $25769803780, %rdx movl %esi, 4(%rdi) movq %rdx, 8(%rdi) movl $8, 16(%rdi) ret then. For the unrolling heuristic it's hard to see the (partly) constant initializer.
The estimation is very close to the actual result for the loop. But it does not take into the account the instructions before the loop that are eliminated due to unrolling. Some heuristic like "initializing the local variable with goes away for unrolled loops if the variable is rewritten in loop or if the variable is not used outside the loop"
On Thu, 4 Apr 2019, antoshkka at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922 > > --- Comment #2 from Antony Polukhin <antoshkka at gmail dot com> --- > The estimation is very close to the actual result for the loop. > > But it does not take into the account the instructions before the loop that are > eliminated due to unrolling. Some heuristic like "initializing the local > variable with goes away for unrolled loops if the variable is rewritten in loop > or if the variable is not used outside the loop" Sure, but you'd need to collect some function-scope information to feed such heuristics to not fall on the wrong-side if those constraints are not met. Was the testcase just an artificial one or does it appear (in this isolated form!) in a real application/benchmark?
> Was the testcase just an artificial one or does it appear (in this > isolated form!) in a real application/benchmark? I was not investigating a particular benchmark or real world application at first. My guess is that heuristic will affect cryptography (initializing big arrays with magic constants) and math (matrix multiplication with identity matrix for example). I've tried to check the validity of the guess. The very first attempt succeeded. Hash computation for a constant string is not well optimized: https://godbolt.org/z/iKi0pb The heuristic may notice that the string is a local variable and may force the loop unrolling. Hash computations on a constant variable is a common case in libstdc++ when working with unordered maps and sets. There's definitely some room for improvement for cases when a local variable is used in the loop only.
the testcase in https://godbolt.org/z/iKi0pb is well optimized in gcc6.5 with O3, but not gcc7 and later. I have checked the gimple code dumped by optimized pass which are same. The difference is done by rtl_cse1 pass.
Estimating sizes for loop 1 BB: 3, after_exit: 0 size: 1 _1 = <retval>.data[j_16]; size: 1 _2 = _1 + j_16; size: 1 <retval>.data[j_16] = _2; size: 1 j_13 = j_16 + 1; Induction variable computation will be folded away. size: 1 ivtmp_4 = ivtmp_18 - 1; Induction variable computation will be folded away. size: 2 if (ivtmp_4 != 0) Exit condition will be eliminated in peeled copies. Exit condition will be eliminated in last copy. Constant conditional. BB: 5, after_exit: 1 size: 7-4, last_iteration: 7-4 Loop size: 7 Estimated size after unrolling: 10 Not unrolling loop 1: size would grow.
Dup of bug 85747. *** This bug has been marked as a duplicate of bug 85747 ***