[Bug tree-optimization/69653] New: More CSE opportunities in address expressions

Wed Feb 3 13:57:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69653

            Bug ID: 69653
           Summary: More CSE opportunities in address expressions
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: amker at gcc dot gnu.org
  Target Milestone: ---

Given below simple test:

#define MAX (1024)

void bar1 (int *, int *, int *, int *);
void bar2 (int *, int *, int *, int *);
int foo (unsigned int n)
{
  unsigned int i;
  int a0[MAX], a1[MAX], a2[MAX], a3[MAX];

  bar1 (a0, a1, a2, a3);
  for (i = 0; i < n; i++)
    {
      unsigned char j = (unsigned char) i;
      a0[j] = a0[j] + 1;
      a1[j + 1] = a1[j + 1] + 2;
      a2[j + 2] = a2[j + 2] + 1;
      a3[j + 3] = a3[j + 3] + 3;
    }
  bar2 (a0, a1, a2, a3);
  return 0;
}

Below code is generated on AArch64 with option "-O2 -mcpu=cortex-a57":

foo:
        sub     sp, sp, #16384
        mov     x1, 12352
        stp     x29, x30, [sp, -64]!
        add     x29, sp, 0
        stp     x19, x20, [sp, 16]
        add     x20, x29, x1
        mov     x1, 8256
        stp     x21, x22, [sp, 32]
        add     x21, x29, x1
        mov     x1, 4160
        add     x22, x29, x1
        add     x19, x29, 64
        str     x23, [sp, 48]
        mov     x3, x20
        mov     w23, w0
        mov     x2, x21
        mov     x1, x22
        mov     x0, x19
        bl      bar1
        cbz     w23, .L4
        mov     w4, 0
        .p2align 2
.L5:
        and     w0, w4, 255
        add     w4, w4, 1
        add     w3, w0, 1
        add     w2, w0, 2
        add     w1, w0, 3
        sxtw    x3, w3
        cmp     w23, w4
        sxtw    x0, w0
        sxtw    x2, w2
        ldr     w7, [x22, x3, lsl 2]
        sxtw    x1, w1
        ldr     w8, [x19, x0, lsl 2]
        ldr     w6, [x21, x2, lsl 2]
        add     w7, w7, 2
        ldr     w5, [x20, x1, lsl 2]
        add     w8, w8, 1
        str     w7, [x22, x3, lsl 2]
        add     w6, w6, 1
        str     w8, [x19, x0, lsl 2]
        add     w5, w5, 3
        str     w6, [x21, x2, lsl 2]
        str     w5, [x20, x1, lsl 2]
        bne     .L5
.L4:
        mov     x3, x20
        mov     x2, x21
        mov     x1, x22
        mov     x0, x19
        bl      bar2
        ldp     x19, x20, [sp, 16]
        mov     w0, 0
        ldp     x21, x22, [sp, 32]
        ldr     x23, [sp, 48]
        ldp     x29, x30, [sp], 64
        add     sp, sp, 16384
        ret

The loop is not optimal because CSE opportunities among address expressions are
missed.  All addr expressions in this case are in the form of "base + i <<
scale + const".  The scaling part "i << scale" is common sub expression, while
"base + const" part is (sfp related) loop invariant.
In my understanding, tree passes like ivopt/slsr should be able to handle cse
in addresses, and let RTL optimizer decide if the loop-invariant part can be
embedded in addressing mode or be hoisted out of loop.

In this case, the index variable isn't a SCEV, so ivopt can't help.  
In my understanding, SLSR are designed to do strength reduction in addr exprs,
maybe it should be improved to deal with addresses with different bases.  Also
lowering all memory reference to MEM_REF/TARGE_MEM_REF before SLSR could make
SLSR's job easier.

After all, below pseudo code in loop is wanted:
Loop-preheader:
  r0 = base0 + const0;
  r1 = base1 + const1;
  r2 = base2 + const2;
  r3 = base3 + const3;
loop-body:
  x = i << scale;
  ldr [r0 + x]
  ldr [r1 + x]
  ldr [r2 + x]
  ldr [r3 + x]
  //...
  str [r0 + x]
  str [r1 + x]
  str [r2 + x]
  str [r3 + x]