47521 – Unnecessary usage of edx register

Bug 47521 - Unnecessary usage of edx register

Summary: Unnecessary usage of edx register

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.6.0

Importance:	P3 minor
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2011-01-28 17:19 UTC by Tony
Modified:	2013-09-17 18:49 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:	4.3.5
Known to fail:	4.4.5, 4.5.2, 4.6.0
Last reconfirmed:	2011-02-02 19:26:49

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tony 2011-01-28 17:19:05 UTC

In testing PR46235 I noticed some minor inefficiency in the usage of an extra register.

The C code is:

int foo(int a, int x, int y)
{
   if  (a & (16))
       return a;
   return 1;
}

Which produces the asm:
        movl    %edi, %eax
        movl    $1, %edx
        testb   $16, %al
        cmove   %edx, %eax
        ret

The above code could have been further optimized to remove the usage of edx:
        movl    $1, %eax
        test    $16, %edi
        cmove   %edi, %eax
        ret

Comment 1 Tony 2011-01-28 17:23:19 UTC

I probably meant "testb   $16, %dil" above...

GCC 4.3.5 avoids the usage of edx, although it too probably has 1 instruction too many:
        testb   $16, %dil
        movl    $1, %eax
        cmove   %eax, %edi
        movl    %edi, %eax
        ret

Comment 2 Tony 2011-01-29 08:41:47 UTC

Changing bug title to include regression, as 4.3.5 is able to avoid the usage of edx.

Comment 3 Jeffrey A. Law 2011-02-02 19:26:49 UTC

Removing the regression tag.

The gcc-4.5 and trunk code, are in effect equivalent.


The seemingly unnecessary use of %edx doesn't affect anything and is merely an artifact of it appearing earlier in the register file -- it has the same costs as other call-clobbered regs and the allocator will use the first one appearing in the register file when multiple regs have the same cost.


Note that it is true that the code could be improved.  This just isn't a regression.


There's two paths to get the desired code.  The first would involve some surgery to the register allocator.  The key here is the pseudo holding "a".  The cost analysis has %eax and %edi as having the same cost.  Allocating into %eax allows the copy to the return value to be eliminated while allocating into %edi allows the copy from the incoming argument to be deleted.

What the allocator fails to realize is that the cmov can be used to eliminate the copy to the return value.  Thus it's actually better to hold "a" in %edi.  Fixing the allocator to realize this (and in fact use the cmov to eliminate the trailing mov) would probably be a PITA.

The other approach is to let combine combine the cmov & move to the return register, even though the latter is a move to a likely-spilled hardreg.  We generally avoid such combinations as to avoid lengthening the lifetime of a likely spilled hardreg, but in this kind of situation, no lengthening would occur.  A quick hack as proof of concept results in:


	testb	$16, %dil	# 36	*testqi_1_maybe_si/2	[length = 4]
	movl	$1, %eax	# 34	*movsi_internal/1	[length = 5]
	cmovne	%edi, %eax	# 19	*movsicc_noc/1	[length = 3]
	ret	# 39	return_internal	[length = 1]

Comment 4 Jeffrey A. Law 2011-02-03 13:54:16 UTC

I'd hoped it would be possible to define a set of conditions under which combine could combine the conditional move with the subsequent move into a hard register.  Without special casing the instruction patterns, reading the constraints and the like, I doubt that's going to be possible.

The other approach I see is to have copyrename avoid coalescing the temporaries with "a".  That in turn prevents "a" from being live throughout the entire function and we get the desired register allocations and code.

Comment 5 Tony 2011-02-03 14:16:01 UTC

As a quick test, would this be fixed by re-ordering the register file to move eax above edx?

If so, then another possible fix to this would be to effectively re-run the RA pass multiple times, each time using a different register file, and then select the one that produces the "best" code and discard the other RA attempts.

The register files would only differ in their sort order when register costs are equal.  I am guessing that only a few such register files would be needed (in particular ones where the eax is shuffled around), rather than every single possible combination of sort orders (which would be prohibitive), so this doesn't necessarily have to impact the length of compilation by much.

A metric would also be needed to be able to then select the "best" version of the compiled code - possibly using number of instructions and number of registers used?

Comment 6 Jeffrey A. Law 2011-02-03 14:32:03 UTC

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/03/11 07:16, tony.poppleton at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521
> 
> --- Comment #5 from Tony Poppleton <tony.poppleton at gmail dot com> 2011-02-03 14:16:01 UTC ---
> As a quick test, would this be fixed by re-ordering the register file to move
> eax above edx?
It's like flipping a coin.  Sometimes its heads, sometimes its tails,
tailoring the compiler around the fact that for this code we got tails
isn't an approach we generally take.  There is no sound reason to change
the ordering of the registers.


> 
> If so, then another possible fix to this would be to effectively re-run the RA
> pass multiple times, each time using a different register file, and then select
> the one that produces the "best" code and discard the other RA attempts.
I don't think we're likely to take this approach.

As I mentioned earlier, using %edx is a complete and total red herring;
ignore it.

The real issue here is there's an unnecessary reg-reg copy.  The
unnecessary copy can be attributed to many factors (coalescing early,
combine's inability to deal with hard regs well, reload's inability to
handle hard regs well in certain circumstances, IRA not realizing the
ranges of "a" can profitably be split, etc).  Those are the avenues I
would suggest someone wanting to fix this bug pursue.

jeff



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJNSrxYAAoJEBRtltQi2kC7QKgH/05w+yEifZXhs+H/ueGrUk7f
Z+H60AE8iGt10Z2w90bQEWAABLk8yeucZi/4g3tpwcqEwbmWn3FSyRYKlebdtPyn
hTuq1ysIbjiN6lSfECvBUApSiATOJkHUi9izvriiOye5Zhn0LUKVBRtFWBLwnE0E
MejL8vqkaMowCjC0+OJxuEyYNNBZj/YleOVwpjIH3kRxyWOv5p0UA2omah1bvDGs
a4a7bCemRB04LByZL30JyXx/wpa8JHcEsLSlM0Him/JiRx6VNMj37uXVQVVn4CfX
JP0CzZq9+7tzeaiT36cfdWKX9kPTRObPPVbAuYhkJc4jgoxsPxlxB6REh5Hkx34=
=YmpV
-----END PGP SIGNATURE-----

Comment 7 Tony 2013-04-10 01:51:36 UTC

This appears to be fixed with GCC 4.8.0 and flag -O2.  The asm code produced is now exactly as Jeff said in comment #3:

        .file   "test.c"
        .text
        .p2align 4,,15
        .globl  foo
        .type   foo, @function
foo:
.LFB0:
        .cfi_startproc
        testb   $16, %dil
        movl    $1, %eax
        cmovne  %edi, %eax
        ret
        .cfi_endproc
.LFE0:
        .size   foo, .-foo
        .ident  "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)"
        .section        .note.GNU-stack,"",@progbits

Comment 8 Jeffrey A. Law 2013-09-17 18:49:10 UTC

Manually confirmed Tony's results in c#7.  Could easily be the switch to LRA which occurred in gcc-4.8.