Created attachment 51168 [details] simplified example reproducing problem Hi, I found -ftree-loop-distribute-patterns to be far too aggressive in replacing code, leading to increased code size and substantial slowdowns (12% in the program I just hit this). The code size increase & slowdown are partially caused by the function call itself, and partially due to the spilling necessary to make that function call. Worsened by the PLT call to memmove(). A very simplified example (also attached) is this: typedef struct node { unsigned char chunks[4]; unsigned char count; } node; void foo(node *a, unsigned char newchunk, unsigned char off) { if (a->count > 3) __builtin_unreachable(); for (int i = a->count - 1; i >= off; i--) a->chunks[i + 1] = a->chunks[i]; a->chunks[off] = newchunk; } which with `-O2 -fPIC` boils down to: foo(node*, unsigned char, unsigned char): pushq %r12 movl %edx, %r8d movl %esi, %r12d pushq %rbp movq %rdi, %rbp pushq %rbx movzbl 4(%rdi), %ecx movzbl %r8b, %ebx leal -1(%rcx), %edx cmpl %ebx, %edx jl .L2 movl %ecx, %eax movslq %edx, %rsi subl %ebx, %ecx subl $1, %ecx movq %rsi, %rdx subq %rcx, %rdx leaq 1(%rcx), %r8 leaq (%rdi,%rdx), %rsi movzbl %al, %edi movq %r8, %rdx movq %rdi, %rax subq %rcx, %rax leaq 0(%rbp,%rax), %rdi call memmove@PLT .L2: movb %r12b, 0(%rbp,%rbx) popq %rbx popq %rbp popq %r12 ret compare to `-O2 -fPIC -fno-tree-loop-distribute-patterns` foo(node*, unsigned char, unsigned char): movzbl 4(%rdi), %eax movzbl %dl, %edx subl $1, %eax cmpl %edx, %eax jl .L2 cltq .L3: movzbl (%rdi,%rax), %ecx movb %cl, 1(%rdi,%rax) subq $1, %rax cmpl %eax, %edx jle .L3 .L2: movb %sil, (%rdi,%rdx) ret Which I think makes the problem apparent. Regards, Andres Freund
Confirmed. The ranges should be figured out that the max size is 4. In fact GCC knows the range really (note I think it is really [0,2] but that is a different issue) and the loop bounds is max 4. ;; basic block 6, loop depth 1, count 805306369 (estimated locally), maybe hot ;; prev block 8, next block 9, flags: (NEW, REACHABLE, VISITED) ;; pred: 9 [always] count:603979777 (estimated locally) (FALLTHRU,DFS_BACK) ;; 8 [always] count:201326592 (estimated locally) (FALLTHRU) # RANGE [-1, 2] # i_17 = PHI <i_16(9), i_11(8)> # .MEM_21 = PHI <.MEM_15(9), .MEM_9(D)(8)> # RANGE [1, 3] NONZERO 3 _4 = i_17 + 1; # VUSE <.MEM_21> _5 = a_10(D)->chunksD.1943[i_17]; # .MEM_15 = VDEF <.MEM_21> a_10(D)->chunksD.1943[_4] = _5; # RANGE [-1, 1] i_16 = i_17 + -1; if (i_16 >= _20) goto <bb 9>; [75.00%] else goto <bb 7>; [25.00%] ;; succ: 9 [75.0% (guessed)] count:603979777 (estimated locally) (TRUE_VALUE,EXECUTABLE) ;; 7 [25.0% (guessed)] count:201326592 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; basic block 9, loop depth 1, count 603979777 (estimated locally), maybe hot ;; prev block 6, next block 7, flags: (NEW) ;; pred: 6 [75.0% (guessed)] count:603979777 (estimated locally) (TRUE_VALUE,EXECUTABLE) goto <bb 6>; [100.00%] ;; succ: 6 [always] count:603979777 (estimated locally) (FALLTHRU,DFS_BACK)
Note this is an older regression but I only marked it as 11/12 as -ftree-loop-distribute-patterns is now enabled at -O2 rather than just -O3.
IIRC one of the issues is that we do not have any strategy to expand memmove calls with bounded size inline. On GIMPLE we're only replacing memmove with constant power-of-two size (and properly aligned pointers depending on targets).
GCC 11.2 is being released, retargeting bugs to GCC 11.3
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
GCC 11.4 is being released, retargeting bugs to GCC 11.5.
GCC 11 branch is being closed.