[Bug tree-optimization/69653] New: More CSE opportunities in address expressions
amker at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Wed Feb 3 13:57:00 GMT 2016
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69653
Bug ID: 69653
Summary: More CSE opportunities in address expressions
Product: gcc
Version: 6.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: amker at gcc dot gnu.org
Target Milestone: ---
Given below simple test:
#define MAX (1024)
void bar1 (int *, int *, int *, int *);
void bar2 (int *, int *, int *, int *);
int foo (unsigned int n)
{
unsigned int i;
int a0[MAX], a1[MAX], a2[MAX], a3[MAX];
bar1 (a0, a1, a2, a3);
for (i = 0; i < n; i++)
{
unsigned char j = (unsigned char) i;
a0[j] = a0[j] + 1;
a1[j + 1] = a1[j + 1] + 2;
a2[j + 2] = a2[j + 2] + 1;
a3[j + 3] = a3[j + 3] + 3;
}
bar2 (a0, a1, a2, a3);
return 0;
}
Below code is generated on AArch64 with option "-O2 -mcpu=cortex-a57":
foo:
sub sp, sp, #16384
mov x1, 12352
stp x29, x30, [sp, -64]!
add x29, sp, 0
stp x19, x20, [sp, 16]
add x20, x29, x1
mov x1, 8256
stp x21, x22, [sp, 32]
add x21, x29, x1
mov x1, 4160
add x22, x29, x1
add x19, x29, 64
str x23, [sp, 48]
mov x3, x20
mov w23, w0
mov x2, x21
mov x1, x22
mov x0, x19
bl bar1
cbz w23, .L4
mov w4, 0
.p2align 2
.L5:
and w0, w4, 255
add w4, w4, 1
add w3, w0, 1
add w2, w0, 2
add w1, w0, 3
sxtw x3, w3
cmp w23, w4
sxtw x0, w0
sxtw x2, w2
ldr w7, [x22, x3, lsl 2]
sxtw x1, w1
ldr w8, [x19, x0, lsl 2]
ldr w6, [x21, x2, lsl 2]
add w7, w7, 2
ldr w5, [x20, x1, lsl 2]
add w8, w8, 1
str w7, [x22, x3, lsl 2]
add w6, w6, 1
str w8, [x19, x0, lsl 2]
add w5, w5, 3
str w6, [x21, x2, lsl 2]
str w5, [x20, x1, lsl 2]
bne .L5
.L4:
mov x3, x20
mov x2, x21
mov x1, x22
mov x0, x19
bl bar2
ldp x19, x20, [sp, 16]
mov w0, 0
ldp x21, x22, [sp, 32]
ldr x23, [sp, 48]
ldp x29, x30, [sp], 64
add sp, sp, 16384
ret
The loop is not optimal because CSE opportunities among address expressions are
missed. All addr expressions in this case are in the form of "base + i <<
scale + const". The scaling part "i << scale" is common sub expression, while
"base + const" part is (sfp related) loop invariant.
In my understanding, tree passes like ivopt/slsr should be able to handle cse
in addresses, and let RTL optimizer decide if the loop-invariant part can be
embedded in addressing mode or be hoisted out of loop.
In this case, the index variable isn't a SCEV, so ivopt can't help.
In my understanding, SLSR are designed to do strength reduction in addr exprs,
maybe it should be improved to deal with addresses with different bases. Also
lowering all memory reference to MEM_REF/TARGE_MEM_REF before SLSR could make
SLSR's job easier.
After all, below pseudo code in loop is wanted:
Loop-preheader:
r0 = base0 + const0;
r1 = base1 + const1;
r2 = base2 + const2;
r3 = base3 + const3;
loop-body:
x = i << scale;
ldr [r0 + x]
ldr [r1 + x]
ldr [r2 + x]
ldr [r3 + x]
//...
str [r0 + x]
str [r1 + x]
str [r2 + x]
str [r3 + x]
More information about the Gcc-bugs
mailing list