Bug 49057 - scheduling difference of subs cause 80% performance difference
Summary: scheduling difference of subs cause 80% performance difference
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.5.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2011-05-19 06:29 UTC by kun
Modified: 2017-07-27 01:42 UTC (History)
0 users

See Also:
Host:
Target: arm-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-07-24 14:27:01


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description kun 2011-05-19 06:29:08 UTC
The following C code is used to do a “integer add” test. The type of n, i, i1, i2, loop_cnt are all ‘int’. the initial value: loop_cnt=5000000, i=0, i1=3, i2=-3.
for (n = loop_cnt; n > 0; n--) {        /*    0    x     -x  - initial value */
                i += i1;                /*    x    x     -x   */
                i1 += i2;               /*    x    0     -x   */
                i1 += i2;               /*    x    -x    -x   */
                i2 += i;                /*    x    -x    0    */
                i2 += i;                /*    x    -x    x    */
                i += i1;                /*    0    -x    x    */
                i += i1;                /*    -x   -x    x    */
                i1 += i2;               /*    -x   0     x    */
                i1 += i2;               /*    -x   x     x    */
                i2 += i;                /*    -x   x     0    */
                i2 += i;                /*    -x   x     -x   */
                i += i1;                /*    0    x     -x   */
                /*
                 * Note that at loop end, i1 = -i2
                 */
                /*
                 * which is as we started.  Thus,
                 */
                /*
                 * the values in the loop are stable
                 */
        }
I use gcc-4.4.2 and gcc-4.5.1 compile this C code, that will generate different binary code.
Gcc-4.42:
284:	e0800003 	add	r0, r0, r3
 288:	e2511001 	subs	r1, r1, #1	; 0x1
 28c:	e0833082 	add	r3, r3, r2, lsl #1
 290:	e0822080 	add	r2, r2, r0, lsl #1
 294:	e0800083 	add	r0, r0, r3, lsl #1
 298:	e0833082 	add	r3, r3, r2, lsl #1
 29c:	e0822080 	add	r2, r2, r0, lsl #1
 2a0:	e0830000 	add	r0, r3, r0
 2a4:	1afffff6 	bne	284 <add_int+0x4c>

Gcc-4.5.1:
138:	e0800003 	add	r0, r0, r3
 13c:	e0833082 	add	r3, r3, r2, lsl #1
 140:	e0822080 	add	r2, r2, r0, lsl #1
 144:	e2511001 	subs	r1, r1, #1
 148:	e0800083 	add	r0, r0, r3, lsl #1
 14c:	e0833082 	add	r3, r3, r2, lsl #1
 150:	e0822080 	add	r2, r2, r0, lsl #1
 154:	e0830000 	add	r0, r3, r0
 158:	1afffff6 	bne	138 <add_int+0x4c>

As you see, the only one difference is the position of “subs	r1, r1, #1”, and this difference has led to huge differences in performance. The performance of the latter just has 80% of the former.
Comment 1 Richard Earnshaw 2011-07-24 14:27:01 UTC
You haven't said what CPU this is for, what options you used when compiling, and you haven't provided a complete testcase.

Are you *absolutely* sure this is the only difference? because I find that hard to believe.  More likely is that the loop has a different alignment, or there is some other, secondary, issue that you've exposed.
Comment 2 Eric Gallager 2017-07-27 01:42:45 UTC
(In reply to Richard Earnshaw from comment #1)
> You haven't said what CPU this is for, what options you used when compiling,
> and you haven't provided a complete testcase.
> 
> Are you *absolutely* sure this is the only difference? because I find that
> hard to believe.  More likely is that the loop has a different alignment, or
> there is some other, secondary, issue that you've exposed.

Reporter never replied; closing