For a simple test-case we got -15% regression with LRA on x86 in 32-bit mode. The test-case is #define byte unsigned char #define MIN(a, b) ((a) > (b)?(b):(a)) void convert_image(byte *in, byte *out, int size) { int i; byte * read = in, * write = out; for(i = 0; i < size; i++) { byte r = *read++; byte g = *read++; byte b = *read++; byte c, m, y, k, tmp; c = 255 - r; m = 255 - g; y = 255 - b; if (c < m) k = MIN (c, y); else k = MIN (m, y); *write++ = c - k; *write++ = m - k; *write++ = y - k; *write++ = k; } } The essential part of assembly is (it is correspondent to write-part of loop): without LRA .L4: movl %esi, %ecx addl $4, %eax subl %ecx, %ebx movzbl 3(%esp), %ecx movb %bl, -4(%eax) movl %esi, %ebx subl %ebx, %edx movb %dl, -2(%eax) subl %ebx, %ecx movb %cl, -3(%eax) cmpl %ebp, 4(%esp) movb %bl, -1(%eax) je .L1 with LRA .L4: movl %esi, %eax subl %eax, %ebx movl 28(%esp), %eax movb %bl, (%eax) movl %esi, %eax subl %eax, %ecx movl 28(%esp), %eax movb %cl, 1(%eax) movl %esi, %eax subl %eax, %edx movl 28(%esp), %eax movb %dl, 2(%eax) addl $4, %eax movl %eax, 28(%esp) movl 28(%esp), %ecx movl %esi, %eax cmpl %ebp, (%esp) movb %al, -1(%ecx) je .L1 I also wonder why additional moves are required to perform subtraction: movl %esi, %eax subl %eax, %ebx whereas only one instruction is required: subl %esi, %ebx. I assume that this part is not related to LRA.
Author: vmakarov Date: Sat Nov 17 17:59:35 2012 New Revision: 193588 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193588 Log: 2012-11-17 Vladimir Makarov <vmakarov@redhat.com> PR rtl-optimization/55342 * lra-assigns.c (spill_for): Try to allocate other reload pseudos before and after spilling. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-assigns.c
The patching compiler produces better binaries but we still have -6% performance degradation on corei7. The main cause of it it that LRA compiler generates spill of 'pure' byte 'g' whereas old compiler generates spill for 'm' that is negation of 'g': gcc wwithout LRA (assembly part the head of loop) .L7: movzbl 1(%edi), %edx leal 3(%edi), %ebp movzbl (%edi), %ebx movl %ebp, %edi notl %edx // perform negation on register movb %dl, 3(%esp) gcc with LRA .L7: movzbl (%edi), %ebx leal 3(%edi), %ebp movzbl 1(%edi), %ecx movl %ebp, %edi movzbl -1(%ebp), %edx notl %ebx notl %ecx movb %dl, (%esp) cmpb %cl, %bl notb (%esp) // perform nagation in memory i.e. wwe have redundant load and store form/to stack. I assume that this should be fixed also.
I'm downgrading this to P2, it is unfortunate we generate slightly worse code on this testcase, but it is more important how LRA vs. reload behaves on average on various benchmarks etc. This doesn't mean this isn't important to look at, but I think it isn't a release blocker at this point.
At least it seems to be confirmed.
GCC 4.8.0 is being released, adjusting target milestone.
GCC 4.8.1 has been released.
(In reply to Yuri Rumyantsev from comment #2) > The patching compiler produces better binaries but we still have -6% > performance degradation on corei7. The main cause of it it that LRA compiler > generates spill of 'pure' byte 'g' whereas old compiler generates spill for > 'm' that is negation of 'g': > > gcc wwithout LRA (assembly part the head of loop) > > .L7: > movzbl 1(%edi), %edx > leal 3(%edi), %ebp > movzbl (%edi), %ebx > movl %ebp, %edi > notl %edx // perform negation on register > movb %dl, 3(%esp) > > gcc with LRA > > .L7: > movzbl (%edi), %ebx > leal 3(%edi), %ebp > movzbl 1(%edi), %ecx > movl %ebp, %edi > movzbl -1(%ebp), %edx > notl %ebx > notl %ecx > movb %dl, (%esp) > cmpb %cl, %bl > notb (%esp) // perform nagation in memory > > i.e. wwe have redundant load and store form/to stack. > > I assume that this should be fixed also. Fixing problem with notl needs implementing a new functionality in LRA: making reloads which stays if the reload pseudo got a hard registers and was inherited (in this case it is profitable). Otherwise the current code should be generated (the reloads and reload pseudos should be removed, the old code should be restored). I've started work on this but it will not be fixed quickly as implementing the new functionality is not trivial task.
Created attachment 30751 [details] modified test-case Modified test-case to reproduce sub-optimal register allocation.
The issue still exists in 4.9 compiler but we got another 30% degradation after r202165 fix. It can be reproduced with modified test-case which as attached with any 4.9 compiler, namely code produced for inner loop looks like: .L8: movl %esi, %ecx movl %esi, %edi movzbl 3(%esp), %edx cmpb %cl, %dl movl %edx, %ecx cmovbe %ecx, %edi .L4: movl %esi, %edx movl %edi, %ecx subl %ecx, %edx movl 28(%esp), %ecx movl 28(%esp), %esi addl $4, 28(%esp) movb %dl, (%ecx) movl %edi, %ecx subl %ecx, %ebx movl %edi, %edx movzbl 3(%esp), %ecx movb %bl, 1(%esi) subl %edx, %ecx movl %edi, %ebx movb %cl, 2(%esi) movl 28(%esp), %esi cmpl %ebp, %eax movb %bl, -1(%esi) je .L1 .L5: movzbl (%eax), %esi leal 3(%eax), %eax movzbl -2(%eax), %ebx notl %esi notl %ebx movl %esi, %edx movzbl -1(%eax), %ecx cmpb %bl, %dl movb %cl, 3(%esp) notb 3(%esp) jb .L8 movzbl 3(%esp), %edx movl %ebx, %edi cmpb %bl, %dl cmovbe %edx, %edi jmp .L4 and you can see that (1) there are 2 additional moves on top of blocks marked with .L4 and .L8; (2) redundant spill/fills of 'write' base in block marked with .L4 (28(%esp)). To reproduce it is sufficient to compile modified test-case with '-m32 -march=atom' options.
After fix rev. 202468 assembly looks slightly better but we met with another RA inefficiency which can be illustrated on the attached (t1.c) test compiled with options "-march=atom -mtune=atom -m32 -O2" that upped bound ol loop check is on register but base register for "write" is on stack: .L8: movzbl 3(%esp), %edx movl %esi, %ecx cmpb %cl, %dl movl %esi, %edi cmovbe %edx, %edi .L4: movl %esi, %edx movl 28(%esp), %esi <-- why write is on stack movl %edi, %ecx addl $4, 28(%esp) <-- perform write incrementation on stack subl %ecx, %edx subl %ecx, %ebx movzbl 3(%esp), %ecx movb %dl, (%esi) movl %edi, %edx subl %edx, %ecx movb %bl, 1(%esi) movb %cl, 2(%esi) movl 28(%esp), %esi cmpl %ebp, %eax <-- why upper bound is in register? movb %dl, -1(%esi) je .L1 .L5: movzbl (%eax), %esi leal 3(%eax), %eax movzbl -2(%eax), %ebx notl %esi notl %ebx movl %esi, %edx movzbl -1(%eax), %ecx cmpb %bl, %dl notl %ecx movb %cl, 3(%esp) jb .L8 movzbl 3(%esp), %edx movl %ebx, %edi cmpb %bl, %dl cmovbe %edx, %edi jmp .L4 Is it something wrong in ATOM cost model? But anyway I assume that keeping upper bound on stack is much cheeper then load base with incrementation from stack.
Created attachment 30816 [details] test-case to reproduce t1.c must be compiled on x86 with options: -O2 -march=atom -mtune=atom -mfpmath=sse -m32
GCC 4.8.2 has been released.
GCC 4.8.3 is being released, adjusting target milestone.
GCC 4.8.4 has been released.
I've examined the various testcases and the complaints about the poor register allocation in this BZ with a trunk compiler. I'm happy to report that I'm seeing none of the issues raised in this BZ. For c#0 (store-back part of the loop): .L5: movl %edi, %ecx addl $4, %esi subl %ecx, %eax subl %ecx, %edx movzbl 3(%esp), %ecx movb %al, -3(%esi) movl %edi, %eax movb %dl, -4(%esi) subl %eax, %ecx movb %cl, -2(%esi) cmpl %ebp, %ebx movb %al, -1(%esi) je .L1 In c#2, the negation sequence is pointed out. We now get: .L9: movzbl (%ebx), %edx movzbl 1(%ebx), %eax addl $3, %ebx movzbl -1(%ebx), %ecx notl %edx notl %eax notl %ecx cmpb %al, %dl movb %cl, 3(%esp) jb .L13 cmpb 3(%esp), %al movzbl %al, %edi jbe .L5 movzbl 3(%esp), %edi jmp .L5 For the 1st modified testcase -O2 -mcpu=atom -m32: .L11: movzbl %al, %edi cmpb %al, %cl cmovbe %ecx, %edi .L4: movl %edi, %eax leal 4(%esi), %esi subl %eax, %edx subl %eax, %ecx movb %dl, -3(%esi) movb %cl, -4(%esi) movzbl 3(%esp), %edx subl %eax, %edx movl %edi, %eax movb %dl, -2(%esi) cmpl %ebx, %ebp movb %al, -1(%esi) je .L1 .L7: movzbl (%ebx), %ecx leal 3(%ebx), %ebx movzbl -2(%ebx), %edx notl %ecx movzbl -1(%ebx), %eax notl %edx notl %eax cmpb %dl, %cl movb %al, 3(%esp) jb .L11 movzbl 3(%esp), %eax movzbl %al, %edi cmpb %al, %dl cmovbe %edx, %edi jmp .L4 Then in c#10 (t1 testcase): .L11: movzbl %al, %edi cmpb %al, %cl cmovbe %ecx, %edi .L4: movl %edi, %eax leal 4(%esi), %esi subl %eax, %edx subl %eax, %ecx movb %dl, -3(%esi) movb %cl, -4(%esi) movzbl 3(%esp), %edx subl %eax, %edx movl %edi, %eax movb %dl, -2(%esi) cmpl %ebp, %ebx movb %al, -1(%esi) je .L1 .L7: movzbl (%ebx), %ecx leal 3(%ebx), %ebx movzbl -2(%ebx), %edx notl %ecx movzbl -1(%ebx), %eax notl %edx notl %eax cmpb %dl, %cl movb %al, 3(%esp) jb .L11 movzbl 3(%esp), %eax movzbl %al, %edi cmpb %al, %dl cmovbe %edx, %edi jmp .L4 Across the board we're not seeing objects spilled into the stack. The code looks quite tight to me. Clearing the regressio marker for GCC 5. I didn't do any bisection work to identify what changes fixed things.
The #c10 issue went away with r204212 I believe.
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
Fixed in GCC 5+