This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)
- From: "marcin.krotkiewski at gmail dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 05 Feb 2014 22:41:08 +0000
- Subject: [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)
- Auto-submitted: auto-generated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Bug ID: 60086
Summary: suboptimal asm generated for a loop (store/load false
aliasing)
Product: gcc
Version: 4.7.3
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: marcin.krotkiewski at gmail dot com
Created attachment 32060
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32060&action=edit
source code that compiles
Hello,
I am seeing suboptimal performance of the following loop compiled with
gcc 4.7.3 (but also 4.4.7, Ubuntu, full test code attached):
for(i=0; i<NSIZE; i++){
a[i] += b[i];
c[i] += d[i];
}
Arrays are dynamically allocated and aligned to page boundary, declared
with __restrict__ and __attribute__((aligned(32))). I am running on
Intel i7-2620M (Sandy Bridge).
The problem is IMHO related to '4k aliasing'. It happens for the most
common case of a/b/c/d starting at page boundary (e.g., natural result
of malloc). To demonstrate, here is the assembly generated with 'gcc
-mtune=native -mavx -O3':
.L8:
vmovapd (%rdx,%rdi), %ymm0 #1 load b
addq $1, %r8 #2
vaddpd (%rcx,%rdi), %ymm0, %ymm0 #3 load a and add
vmovapd %ymm0, (%rdx,%rdi) #4 store a
vmovapd (%rax,%rdi), %ymm0 #5 load d
vaddpd (%rsi,%rdi), %ymm0, %ymm0 #6 load c and add
vmovapd %ymm0, (%rax,%rdi) #7 store c
addq $32, %rdi #8
cmpq %r8, %r12 #9
ja .L8 #10
The 4k aliasing problem is caused by lines 4 and 5 (writing result to
array a and reading data from either c or d). From my tests this seems
to be the default behavior for both AVX and SSE2 instruction sets, and
for both vectorized and non-vectorized cases.
It is easy to fix the problem by placing the two writes together, at the
end of the iteration, e.g.:
.L8:
vmovapd (%rdx,%rdi), %ymm1 #1
addq $1, %r8 #2
vaddpd (%rcx,%rdi), %ymm1, %ymm1 #3
vmovapd (%rax,%rdi), %ymm0 #4
vaddpd (%rsi,%rdi), %ymm0, %ymm0 #5
vmovapd %ymm1, (%rdx,%rdi) #6
vmovapd %ymm0, (%rax,%rdi) #7
addq $32, %rdi #8
cmpq %r8, %r12 #9
ja .L8 #10
In this case the writes happen after all the loads. The above code is
(almost) what ICC generates for this case. For problem sizes small
enough to fit in L1 the speedup is roughly 50%.