101481 – [12/13/14/15 Regression] -ftree-loop-distribute-patterns can slow down and increases size of code

Bug 101481 - [12/13/14/15 Regression] -ftree-loop-distribute-patterns can slow down and increases size of code

Summary: [12/13/14/15 Regression] -ftree-loop-distribute-patterns can slow down and in...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	11.1.1

Importance:	P2 normal
Target Milestone:	12.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2021-07-17 02:59 UTC by Andres Freund
Modified:	2024-07-19 13:12 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-07-17 00:00:00

Attachments
simplified example reproducing problem (258 bytes, text/x-csrc) 2021-07-17 02:59 UTC, Andres Freund	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andres Freund 2021-07-17 02:59:16 UTC

Created attachment 51168 [details]
simplified example reproducing problem

Hi,

I found -ftree-loop-distribute-patterns to be far too aggressive in replacing code, leading to increased code size and substantial slowdowns (12% in the program I just hit this).

The code size increase & slowdown are partially caused by the function call itself, and partially due to the spilling necessary to make that function call. Worsened by the PLT call to memmove().

A very simplified example (also attached) is this:

typedef struct node
{
    unsigned char chunks[4];
    unsigned char count;
} node;

void
foo(node *a, unsigned char newchunk, unsigned char off)
{
    if (a->count > 3)
        __builtin_unreachable();

    for (int i = a->count - 1; i >= off; i--)
        a->chunks[i + 1] = a->chunks[i];
    a->chunks[off] = newchunk;
}

which with `-O2 -fPIC` boils down to:
foo(node*, unsigned char, unsigned char):
        pushq   %r12
        movl    %edx, %r8d
        movl    %esi, %r12d
        pushq   %rbp
        movq    %rdi, %rbp
        pushq   %rbx
        movzbl  4(%rdi), %ecx
        movzbl  %r8b, %ebx
        leal    -1(%rcx), %edx
        cmpl    %ebx, %edx
        jl      .L2
        movl    %ecx, %eax
        movslq  %edx, %rsi
        subl    %ebx, %ecx
        subl    $1, %ecx
        movq    %rsi, %rdx
        subq    %rcx, %rdx
        leaq    1(%rcx), %r8
        leaq    (%rdi,%rdx), %rsi
        movzbl  %al, %edi
        movq    %r8, %rdx
        movq    %rdi, %rax
        subq    %rcx, %rax
        leaq    0(%rbp,%rax), %rdi
        call    memmove@PLT
.L2:
        movb    %r12b, 0(%rbp,%rbx)
        popq    %rbx
        popq    %rbp
        popq    %r12
        ret

compare to `-O2 -fPIC -fno-tree-loop-distribute-patterns`

foo(node*, unsigned char, unsigned char):
        movzbl  4(%rdi), %eax
        movzbl  %dl, %edx
        subl    $1, %eax
        cmpl    %edx, %eax
        jl      .L2
        cltq
.L3:
        movzbl  (%rdi,%rax), %ecx
        movb    %cl, 1(%rdi,%rax)
        subq    $1, %rax
        cmpl    %eax, %edx
        jle     .L3
.L2:
        movb    %sil, (%rdi,%rdx)
        ret

Which I think makes the problem apparent.

Regards,

Andres Freund

Comment 1 Andrew Pinski 2021-07-17 03:36:06 UTC

Confirmed.  The ranges should be figured out that the max size is 4.

In fact GCC knows the range really (note I think it is really [0,2] but that is a different issue) and the loop bounds is max 4.

;;   basic block 6, loop depth 1, count 805306369 (estimated locally), maybe hot
;;    prev block 8, next block 9, flags: (NEW, REACHABLE, VISITED)
;;    pred:       9 [always]  count:603979777 (estimated locally) (FALLTHRU,DFS_BACK)
;;                8 [always]  count:201326592 (estimated locally) (FALLTHRU)
  # RANGE [-1, 2]
  # i_17 = PHI <i_16(9), i_11(8)>
  # .MEM_21 = PHI <.MEM_15(9), .MEM_9(D)(8)>
  # RANGE [1, 3] NONZERO 3
  _4 = i_17 + 1;
  # VUSE <.MEM_21>
  _5 = a_10(D)->chunksD.1943[i_17];
  # .MEM_15 = VDEF <.MEM_21>
  a_10(D)->chunksD.1943[_4] = _5;
  # RANGE [-1, 1]
  i_16 = i_17 + -1;
  if (i_16 >= _20)
    goto <bb 9>; [75.00%]
  else
    goto <bb 7>; [25.00%]
;;    succ:       9 [75.0% (guessed)]  count:603979777 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                7 [25.0% (guessed)]  count:201326592 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 9, loop depth 1, count 603979777 (estimated locally), maybe hot
;;    prev block 6, next block 7, flags: (NEW)
;;    pred:       6 [75.0% (guessed)]  count:603979777 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  goto <bb 6>; [100.00%]
;;    succ:       6 [always]  count:603979777 (estimated locally) (FALLTHRU,DFS_BACK)

Comment 2 Andrew Pinski 2021-07-17 03:37:00 UTC

Note this is an older regression but I only marked it as 11/12 as -ftree-loop-distribute-patterns is now enabled at -O2 rather than just -O3.

Comment 3 Richard Biener 2021-07-19 06:29:29 UTC

IIRC one of the issues is that we do not have any strategy to expand memmove calls with bounded size inline.  On GIMPLE we're only replacing memmove
with constant power-of-two size (and properly aligned pointers depending on targets).

Comment 4 Richard Biener 2021-07-28 07:07:33 UTC

GCC 11.2 is being released, retargeting bugs to GCC 11.3

Comment 5 Richard Biener 2022-04-21 07:49:57 UTC

GCC 11.3 is being released, retargeting bugs to GCC 11.4.

Comment 6 Jakub Jelinek 2023-05-29 10:05:19 UTC

GCC 11.4 is being released, retargeting bugs to GCC 11.5.

Comment 7 Richard Biener 2024-07-19 13:12:54 UTC

GCC 11 branch is being closed.