49057 – scheduling difference of subs cause 80% performance difference

Bug 49057 - scheduling difference of subs cause 80% performance difference

Summary: scheduling difference of subs cause 80% performance difference

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.5.1

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2011-05-19 06:29 UTC by kun
Modified:	2017-07-27 01:42 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	arm--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2011-07-24 14:27:01

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description kun 2011-05-19 06:29:08 UTC

The following C code is used to do a “integer add” test. The type of n, i, i1, i2, loop_cnt are all ‘int’. the initial value: loop_cnt=5000000, i=0, i1=3, i2=-3.
for (n = loop_cnt; n > 0; n--) {        /*    0    x     -x  - initial value */
                i += i1;                /*    x    x     -x   */
                i1 += i2;               /*    x    0     -x   */
                i1 += i2;               /*    x    -x    -x   */
                i2 += i;                /*    x    -x    0    */
                i2 += i;                /*    x    -x    x    */
                i += i1;                /*    0    -x    x    */
                i += i1;                /*    -x   -x    x    */
                i1 += i2;               /*    -x   0     x    */
                i1 += i2;               /*    -x   x     x    */
                i2 += i;                /*    -x   x     0    */
                i2 += i;                /*    -x   x     -x   */
                i += i1;                /*    0    x     -x   */
                /*
                 * Note that at loop end, i1 = -i2
                 */
                /*
                 * which is as we started.  Thus,
                 */
                /*
                 * the values in the loop are stable
                 */
        }
I use gcc-4.4.2 and gcc-4.5.1 compile this C code, that will generate different binary code.
Gcc-4.42:
284:	e0800003 	add	r0, r0, r3
 288:	e2511001 	subs	r1, r1, #1	; 0x1
 28c:	e0833082 	add	r3, r3, r2, lsl #1
 290:	e0822080 	add	r2, r2, r0, lsl #1
 294:	e0800083 	add	r0, r0, r3, lsl #1
 298:	e0833082 	add	r3, r3, r2, lsl #1
 29c:	e0822080 	add	r2, r2, r0, lsl #1
 2a0:	e0830000 	add	r0, r3, r0
 2a4:	1afffff6 	bne	284 <add_int+0x4c>

Gcc-4.5.1:
138:	e0800003 	add	r0, r0, r3
 13c:	e0833082 	add	r3, r3, r2, lsl #1
 140:	e0822080 	add	r2, r2, r0, lsl #1
 144:	e2511001 	subs	r1, r1, #1
 148:	e0800083 	add	r0, r0, r3, lsl #1
 14c:	e0833082 	add	r3, r3, r2, lsl #1
 150:	e0822080 	add	r2, r2, r0, lsl #1
 154:	e0830000 	add	r0, r3, r0
 158:	1afffff6 	bne	138 <add_int+0x4c>

As you see, the only one difference is the position of “subs	r1, r1, #1”, and this difference has led to huge differences in performance. The performance of the latter just has 80% of the former.

Comment 1 Richard Earnshaw 2011-07-24 14:27:01 UTC

You haven't said what CPU this is for, what options you used when compiling, and you haven't provided a complete testcase.

Are you *absolutely* sure this is the only difference? because I find that hard to believe.  More likely is that the loop has a different alignment, or there is some other, secondary, issue that you've exposed.

Comment 2 Eric Gallager 2017-07-27 01:42:45 UTC

(In reply to Richard Earnshaw from comment #1)
> You haven't said what CPU this is for, what options you used when compiling,
> and you haven't provided a complete testcase.
> 
> Are you *absolutely* sure this is the only difference? because I find that
> hard to believe.  More likely is that the loop has a different alignment, or
> there is some other, secondary, issue that you've exposed.

Reporter never replied; closing