Bug 55342 - [4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
Summary: [4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.8.0
: P2 normal
Target Milestone: 5.0
Assignee: Vladimir Makarov
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2012-11-15 15:25 UTC by Yuri Rumyantsev
Modified: 2016-08-03 10:47 UTC (History)
9 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work: 5.0
Known to fail: 4.9.4
Last reconfirmed: 2013-02-20 00:00:00


Attachments
modified test-case (300 bytes, text/plain)
2013-09-05 14:44 UTC, Yuri Rumyantsev
Details
test-case to reproduce (263 bytes, text/plain)
2013-09-13 13:03 UTC, Yuri Rumyantsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2012-11-15 15:25:20 UTC
For a simple test-case we got -15% regression with LRA on x86 in 32-bit mode. The test-case is

#define byte unsigned char
#define MIN(a, b) ((a) > (b)?(b):(a))

void convert_image(byte *in, byte *out, int size) {
    int i;
    byte * read = in,
	 * write = out;
    for(i = 0; i < size; i++) {
        byte r = *read++;
        byte g = *read++;
        byte b = *read++;
        byte c, m, y, k, tmp;
        c = 255 - r;
        m = 255 - g;
        y = 255 - b;
	if (c < m)
	  k = MIN (c, y);
	else
          k = MIN (m, y);
        *write++ = c - k;
        *write++ = m - k;
        *write++ = y - k;
        *write++ = k;
    }
}

The essential part of assembly is (it is correspondent to write-part of loop): 

without LRA
.L4:
	movl	%esi, %ecx
	addl	$4, %eax
	subl	%ecx, %ebx
	movzbl	3(%esp), %ecx
	movb	%bl, -4(%eax)
	movl	%esi, %ebx
	subl	%ebx, %edx
	movb	%dl, -2(%eax)
	subl	%ebx, %ecx
	movb	%cl, -3(%eax)
	cmpl	%ebp, 4(%esp)
	movb	%bl, -1(%eax)
	je	.L1

with LRA

.L4:
	movl	%esi, %eax
	subl	%eax, %ebx
	movl	28(%esp), %eax
	movb	%bl, (%eax)
	movl	%esi, %eax
	subl	%eax, %ecx
	movl	28(%esp), %eax
	movb	%cl, 1(%eax)
	movl	%esi, %eax
	subl	%eax, %edx
	movl	28(%esp), %eax
	movb	%dl, 2(%eax)
	addl	$4, %eax
	movl	%eax, 28(%esp)
	movl	28(%esp), %ecx
	movl	%esi, %eax
	cmpl	%ebp, (%esp)
	movb	%al, -1(%ecx)
	je	.L1

I also wonder why additional moves are required to perform subtraction:

    movl  %esi, %eax
    subl  %eax, %ebx

whereas only one instruction is required:
    subl  %esi, %ebx.

I assume that this part is not related to LRA.
Comment 1 Vladimir Makarov 2012-11-17 17:59:41 UTC
Author: vmakarov
Date: Sat Nov 17 17:59:35 2012
New Revision: 193588

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193588
Log:
2012-11-17  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/55342
	* lra-assigns.c (spill_for): Try to allocate other reload pseudos
	before and after spilling.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lra-assigns.c
Comment 2 Yuri Rumyantsev 2012-11-19 12:06:20 UTC
The patching compiler produces better binaries but we still have -6% performance degradation on corei7. The main cause of it it that LRA compiler generates spill of 'pure' byte 'g' whereas old compiler generates spill for 'm' that is negation of 'g':

gcc wwithout LRA (assembly part the head of loop)

.L7:
	movzbl	1(%edi), %edx
	leal	3(%edi), %ebp
	movzbl	(%edi), %ebx
	movl	%ebp, %edi
	notl	%edx   // perform negation on register
	movb	%dl, 3(%esp)

gcc with LRA

.L7:
	movzbl	(%edi), %ebx
	leal	3(%edi), %ebp
	movzbl	1(%edi), %ecx
	movl	%ebp, %edi
	movzbl	-1(%ebp), %edx
	notl	%ebx
	notl	%ecx
	movb	%dl, (%esp)
	cmpb	%cl, %bl
	notb	(%esp) // perform nagation in memory

i.e. wwe have redundant load and store form/to stack.

I assume that this should be fixed also.
Comment 3 Jakub Jelinek 2013-01-28 18:53:25 UTC
I'm downgrading this to P2, it is unfortunate we generate slightly worse code on this testcase, but it is more important how LRA vs. reload behaves on average on various benchmarks etc.  This doesn't mean this isn't important to look at, but I think it isn't a release blocker at this point.
Comment 4 Richard Biener 2013-02-20 14:09:59 UTC
At least it seems to be confirmed.
Comment 5 Jakub Jelinek 2013-03-22 14:44:31 UTC
GCC 4.8.0 is being released, adjusting target milestone.
Comment 6 Jakub Jelinek 2013-05-31 10:58:38 UTC
GCC 4.8.1 has been released.
Comment 7 Vladimir Makarov 2013-06-06 15:20:02 UTC
(In reply to Yuri Rumyantsev from comment #2)
> The patching compiler produces better binaries but we still have -6%
> performance degradation on corei7. The main cause of it it that LRA compiler
> generates spill of 'pure' byte 'g' whereas old compiler generates spill for
> 'm' that is negation of 'g':
> 
> gcc wwithout LRA (assembly part the head of loop)
> 
> .L7:
> 	movzbl	1(%edi), %edx
> 	leal	3(%edi), %ebp
> 	movzbl	(%edi), %ebx
> 	movl	%ebp, %edi
> 	notl	%edx   // perform negation on register
> 	movb	%dl, 3(%esp)
> 
> gcc with LRA
> 
> .L7:
> 	movzbl	(%edi), %ebx
> 	leal	3(%edi), %ebp
> 	movzbl	1(%edi), %ecx
> 	movl	%ebp, %edi
> 	movzbl	-1(%ebp), %edx
> 	notl	%ebx
> 	notl	%ecx
> 	movb	%dl, (%esp)
> 	cmpb	%cl, %bl
> 	notb	(%esp) // perform nagation in memory
> 
> i.e. wwe have redundant load and store form/to stack.
> 
> I assume that this should be fixed also.

Fixing problem with notl needs implementing a new functionality in LRA: making reloads which stays if the reload pseudo got a hard registers and was inherited (in this case it is profitable).  Otherwise the current code should be generated (the reloads and reload pseudos should be removed, the old code should be restored).  I've started work on this but it will not be fixed quickly as implementing the new functionality is not trivial task.
Comment 8 Yuri Rumyantsev 2013-09-05 14:44:09 UTC
Created attachment 30751 [details]
modified test-case

Modified test-case to reproduce sub-optimal register allocation.
Comment 9 Yuri Rumyantsev 2013-09-05 14:51:33 UTC
The issue still exists in 4.9 compiler but we got another 30% degradation after r202165 fix. It can be reproduced with modified test-case which as attached with any 4.9 compiler, namely code produced for inner loop looks like:

.L8:
	movl	%esi, %ecx
	movl	%esi, %edi
	movzbl	3(%esp), %edx
	cmpb	%cl, %dl
	movl	%edx, %ecx
	cmovbe	%ecx, %edi
.L4:
	movl	%esi, %edx
	movl	%edi, %ecx
	subl	%ecx, %edx
	movl	28(%esp), %ecx
	movl	28(%esp), %esi
	addl	$4, 28(%esp)
	movb	%dl, (%ecx)
	movl	%edi, %ecx
	subl	%ecx, %ebx
	movl	%edi, %edx
	movzbl	3(%esp), %ecx
	movb	%bl, 1(%esi)
	subl	%edx, %ecx
	movl	%edi, %ebx
	movb	%cl, 2(%esi)
	movl	28(%esp), %esi
	cmpl	%ebp, %eax
	movb	%bl, -1(%esi)
	je	.L1
.L5:
	movzbl	(%eax), %esi
	leal	3(%eax), %eax
	movzbl	-2(%eax), %ebx
	notl	%esi
	notl	%ebx
	movl	%esi, %edx
	movzbl	-1(%eax), %ecx
	cmpb	%bl, %dl
	movb	%cl, 3(%esp)
	notb	3(%esp)
	jb	.L8
	movzbl	3(%esp), %edx
	movl	%ebx, %edi
	cmpb	%bl, %dl
	cmovbe	%edx, %edi
	jmp	.L4

and you can see that (1) there are 2 additional moves on top of blocks marked with .L4 and .L8; (2) redundant spill/fills of 'write' base in block marked with .L4 (28(%esp)).
To reproduce it is sufficient to compile modified test-case with '-m32 -march=atom' options.
Comment 10 Yuri Rumyantsev 2013-09-13 13:00:23 UTC
After fix rev. 202468 assembly looks slightly better but we met with another RA inefficiency which can be illustrated on the attached (t1.c) test compiled with options "-march=atom -mtune=atom -m32 -O2" that upped bound ol loop check is on register but base register for "write" is on stack:

.L8:
	movzbl	3(%esp), %edx
	movl	%esi, %ecx
	cmpb	%cl, %dl
	movl	%esi, %edi
	cmovbe	%edx, %edi
.L4:
	movl	%esi, %edx
	movl	28(%esp), %esi  <-- why write is on stack
	movl	%edi, %ecx
	addl	$4, 28(%esp)  <-- perform write incrementation on stack
	subl	%ecx, %edx
	subl	%ecx, %ebx
	movzbl	3(%esp), %ecx
	movb	%dl, (%esi)
	movl	%edi, %edx
	subl	%edx, %ecx
	movb	%bl, 1(%esi)
	movb	%cl, 2(%esi)
	movl	28(%esp), %esi
	cmpl	%ebp, %eax  <-- why upper bound is in register?
	movb	%dl, -1(%esi)
	je	.L1
.L5:
	movzbl	(%eax), %esi
	leal	3(%eax), %eax
	movzbl	-2(%eax), %ebx
	notl	%esi
	notl	%ebx
	movl	%esi, %edx
	movzbl	-1(%eax), %ecx
	cmpb	%bl, %dl
	notl	%ecx
	movb	%cl, 3(%esp)
	jb	.L8
	movzbl	3(%esp), %edx
	movl	%ebx, %edi
	cmpb	%bl, %dl
	cmovbe	%edx, %edi
	jmp	.L4

Is it something wrong in ATOM cost model? But anyway I assume that keeping upper bound on stack is much cheeper then load base with incrementation from stack.
Comment 11 Yuri Rumyantsev 2013-09-13 13:03:04 UTC
Created attachment 30816 [details]
test-case to reproduce

t1.c must be compiled on x86 with options:

-O2 -march=atom -mtune=atom -mfpmath=sse -m32
Comment 12 Jakub Jelinek 2013-10-16 09:49:19 UTC
GCC 4.8.2 has been released.
Comment 13 Richard Biener 2014-05-22 09:04:09 UTC
GCC 4.8.3 is being released, adjusting target milestone.
Comment 14 Jakub Jelinek 2014-12-19 13:33:24 UTC
GCC 4.8.4 has been released.
Comment 15 Jeffrey A. Law 2015-02-12 07:19:13 UTC
I've examined the various testcases and the complaints about the poor register allocation in this BZ with a trunk compiler.

I'm happy to report that I'm seeing none of the issues raised in this BZ.  

For c#0 (store-back part of the loop):
.L5:
        movl    %edi, %ecx
        addl    $4, %esi
        subl    %ecx, %eax
        subl    %ecx, %edx
        movzbl  3(%esp), %ecx
        movb    %al, -3(%esi)
        movl    %edi, %eax
        movb    %dl, -4(%esi)
        subl    %eax, %ecx
        movb    %cl, -2(%esi)
        cmpl    %ebp, %ebx
        movb    %al, -1(%esi)
        je      .L1

In c#2, the negation sequence is pointed out.  We now get:

.L9:
        movzbl  (%ebx), %edx
        movzbl  1(%ebx), %eax
        addl    $3, %ebx
        movzbl  -1(%ebx), %ecx
        notl    %edx
        notl    %eax
        notl    %ecx
        cmpb    %al, %dl
        movb    %cl, 3(%esp)
        jb      .L13
        cmpb    3(%esp), %al
        movzbl  %al, %edi
        jbe     .L5
        movzbl  3(%esp), %edi
        jmp     .L5

For the 1st modified testcase -O2 -mcpu=atom -m32:

.L11:
        movzbl  %al, %edi
        cmpb    %al, %cl
        cmovbe  %ecx, %edi
.L4:
        movl    %edi, %eax
        leal    4(%esi), %esi
        subl    %eax, %edx
        subl    %eax, %ecx
        movb    %dl, -3(%esi)
        movb    %cl, -4(%esi)
        movzbl  3(%esp), %edx
        subl    %eax, %edx
        movl    %edi, %eax
        movb    %dl, -2(%esi)
        cmpl    %ebx, %ebp
        movb    %al, -1(%esi)
        je      .L1
.L7:
        movzbl  (%ebx), %ecx
        leal    3(%ebx), %ebx
        movzbl  -2(%ebx), %edx
        notl    %ecx
        movzbl  -1(%ebx), %eax
        notl    %edx
        notl    %eax
        cmpb    %dl, %cl
        movb    %al, 3(%esp)
        jb      .L11
        movzbl  3(%esp), %eax
        movzbl  %al, %edi
        cmpb    %al, %dl
        cmovbe  %edx, %edi
        jmp     .L4

Then in c#10 (t1 testcase):

.L11:
        movzbl  %al, %edi
        cmpb    %al, %cl
        cmovbe  %ecx, %edi
.L4:
        movl    %edi, %eax
        leal    4(%esi), %esi
        subl    %eax, %edx
        subl    %eax, %ecx
        movb    %dl, -3(%esi)
        movb    %cl, -4(%esi)
        movzbl  3(%esp), %edx
        subl    %eax, %edx
        movl    %edi, %eax
        movb    %dl, -2(%esi)
        cmpl    %ebp, %ebx
        movb    %al, -1(%esi)
        je      .L1
.L7:
        movzbl  (%ebx), %ecx
        leal    3(%ebx), %ebx
        movzbl  -2(%ebx), %edx
        notl    %ecx
        movzbl  -1(%ebx), %eax
        notl    %edx
        notl    %eax
        cmpb    %dl, %cl
        movb    %al, 3(%esp)
        jb      .L11
        movzbl  3(%esp), %eax
        movzbl  %al, %edi
        cmpb    %al, %dl
        cmovbe  %edx, %edi
        jmp     .L4


Across the board we're not seeing objects spilled into the stack.  The code looks quite tight to me.

Clearing the regressio marker for GCC 5.  I didn't do any bisection work to identify what changes fixed things.
Comment 16 Jakub Jelinek 2015-02-12 13:47:47 UTC
The #c10 issue went away with r204212 I believe.
Comment 17 Richard Biener 2015-06-23 08:15:52 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 18 Jakub Jelinek 2015-06-26 19:57:30 UTC
GCC 4.9.3 has been released.
Comment 19 Richard Biener 2016-08-03 10:47:41 UTC
Fixed in GCC 5+