[Bug tree-optimization/60577] New: inefficient FDO instrumentation code

Wed Mar 19 03:02:00 GMT 2014

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60577

            Bug ID: 60577
           Summary: inefficient FDO instrumentation code
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: carrot at google dot com

This is actually a regression caused by r175916.

Compile the following code with options -O2 -fno-strict-aliasing
-fprofile-generate

struct thread_param
{
  long* buf;
  long iterations;
  long accesses;
} param;

void access_buf(struct thread_param* p)
{
  long i,j;
  long iterations = p->iterations;
  long accesses = p->accesses;
  for (i=0; i<iterations; i++)
  {
    long* pbuf = p->buf;
    for (j=0; j<accesses; j++)
      pbuf[j] += 1;
  }
}

Trunk gcc generates following for innermost loop:

.L9:
        addq    $1, __gcov0.access_buf(%rip)
        addq    $1, (%rax)
        addq    $8, %rax
        cmpq    %rdx, %rax
        jne     .L9

The fdo counter in memory is incremented in each iteration.

GCC at revision r175915 generates following for innermost loop

        movq    .LPBX1(%rip), %rsi
    ...
.L4:
        addq    $1, (%rax)
        addq    $8, %rax
        cmpq    %rdx, %rax
        jne     .L4
        leaq    1(%rsi,%r9), %rsi
    ...
    movq    %rsi, .LPBX1(%rip)

The fdo counter doesn't bring any overhead to the innermost loop.

GCC at revision r175916 generates following for innermost loop

        movq    .LPBX1(%rip), %rcx
        xorl    %eax, %eax
        leaq    1(%rcx), %r8 
        .p2align 4,,10
        .p2align 3
.L4:
        leaq    (%r8,%rax), %rcx
        movq    %rcx, .LPBX1(%rip)
        addq    $1, (%rdx,%rax,8)
        addq    $1, %rax
        cmpq    %rsi, %rax
        jne     .L4

The fdo counter is incremented and written to memory in each iteration.