Bug 55623 - [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
Summary: [ARM] GCC should not prefer long dependency chains, they inhibit performance ...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.7.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2012-12-09 10:00 UTC by Siarhei Siamashka
Modified: 2015-08-14 12:22 UTC (History)
3 users (show)

See Also:
Host:
Target: arm
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-12-09 00:00:00


Attachments
badsched.c (357 bytes, text/plain)
2012-12-09 10:00 UTC, Siarhei Siamashka
Details
badschedmul.c (357 bytes, text/plain)
2012-12-09 11:21 UTC, Siarhei Siamashka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Siarhei Siamashka 2012-12-09 10:00:09 UTC
This is a missing optimization. Or in this particular case, it's more like GCC is reversing an attempt of a programmer to optimize the code for superscalar dual-issue processors.

$ arm-none-linux-gnueabi-gcc -O2 -mcpu=cortex-a8 -o badsched badsched.c
$ objdump -d badsched

00000000 <f1>:
   0:	e1a03120 	lsr	r3, r0, #2
   4:	e08330a0 	add	r3, r3, r0, lsr #1
   8:	e08331a0 	add	r3, r3, r0, lsr #3
   c:	e0833220 	add	r3, r3, r0, lsr #4
  10:	e08332a0 	add	r3, r3, r0, lsr #5
  14:	e0833320 	add	r3, r3, r0, lsr #6
  18:	e08333a0 	add	r3, r3, r0, lsr #7
  1c:	e0833420 	add	r3, r3, r0, lsr #8
  20:	e08334a0 	add	r3, r3, r0, lsr #9
  24:	e0833520 	add	r3, r3, r0, lsr #10
  28:	e08335a0 	add	r3, r3, r0, lsr #11
  2c:	e0833620 	add	r3, r3, r0, lsr #12
  30:	e08336a0 	add	r3, r3, r0, lsr #13
  34:	e0833720 	add	r3, r3, r0, lsr #14
  38:	e08337a0 	add	r3, r3, r0, lsr #15
  3c:	e0833820 	add	r3, r3, r0, lsr #16
  40:	e08338a0 	add	r3, r3, r0, lsr #17
  44:	e0833920 	add	r3, r3, r0, lsr #18
  48:	e08339a0 	add	r3, r3, r0, lsr #19
  4c:	e0833a20 	add	r3, r3, r0, lsr #20
  50:	e0833aa0 	add	r3, r3, r0, lsr #21
  54:	e0833b20 	add	r3, r3, r0, lsr #22
  58:	e0833ba0 	add	r3, r3, r0, lsr #23
  5c:	e0830c20 	add	r0, r3, r0, lsr #24
  60:	e12fff1e 	bx	lr

00000064 <f2>:
  64:	e1a031a0 	lsr	r3, r0, #3
  68:	e1a02220 	lsr	r2, r0, #4
  6c:	e08330a0 	add	r3, r3, r0, lsr #1
  70:	e0822120 	add	r2, r2, r0, lsr #2
  74:	e08332a0 	add	r3, r3, r0, lsr #5
  78:	e0822320 	add	r2, r2, r0, lsr #6
  7c:	e08333a0 	add	r3, r3, r0, lsr #7
  80:	e0822420 	add	r2, r2, r0, lsr #8
  84:	e08334a0 	add	r3, r3, r0, lsr #9
  88:	e0822520 	add	r2, r2, r0, lsr #10
  8c:	e08335a0 	add	r3, r3, r0, lsr #11
  90:	e0822620 	add	r2, r2, r0, lsr #12
  94:	e08336a0 	add	r3, r3, r0, lsr #13
  98:	e0822720 	add	r2, r2, r0, lsr #14
  9c:	e08337a0 	add	r3, r3, r0, lsr #15
  a0:	e0822820 	add	r2, r2, r0, lsr #16
  a4:	e08338a0 	add	r3, r3, r0, lsr #17
  a8:	e0822920 	add	r2, r2, r0, lsr #18
  ac:	e08339a0 	add	r3, r3, r0, lsr #19
  b0:	e0822a20 	add	r2, r2, r0, lsr #20
  b4:	e0833aa0 	add	r3, r3, r0, lsr #21
  b8:	e0822b20 	add	r2, r2, r0, lsr #22
  bc:	e0833ba0 	add	r3, r3, r0, lsr #23
  c0:	e0820c20 	add	r0, r2, r0, lsr #24
  c4:	e0800003 	add	r0, r0, r3
  c8:	e12fff1e 	bx	lr

Guess which one of these two functions will be faster?

=== Cortex-A8 @1000MHz ===

$ time ./badsched 1

real	0m2.512s
user	0m2.500s
sys	0m0.000s

$ time ./badsched 2

real	0m2.064s
user	0m2.008s
sys	0m0.008s

=== Cortex-A15 @1700MHz ===

real	0m2.786s
user	0m2.770s
sys	0m0.005s

real	0m1.451s
user	0m1.440s
sys	0m0.005s

There is a function call and loop overhead which prevents Cortex-A8 from showing ~2x better performance in the case of using "f2" function. We can try to mark these function as static in order to get them inlined, but in this case the asm workaround becomes ineffective in a rather interesting way, which also demonstrates instructions scheduling issues.
Comment 1 Siarhei Siamashka 2012-12-09 10:00:59 UTC
Created attachment 28904 [details]
badsched.c
Comment 2 Andrew Pinski 2012-12-09 10:48:06 UTC
This is an ARM (both arm32 and arm64) specific issue due to the shifts being "free".  If you look at the mips assembly, it looks good for a dual issue processor as it is scheduled as an add followed by a shift.

I think the issue is reassocdoes not know that shifts are free on arm.
Comment 3 Siarhei Siamashka 2012-12-09 11:18:56 UTC
(In reply to comment #2)
> This is an ARM (both arm32 and arm64) specific issue due to the shifts being
> "free".  If you look at the mips assembly, it looks good for a dual issue
> processor as it is scheduled as an add followed by a shift.
> 
> I think the issue is reassocdoes not know that shifts are free on arm.

This does not look like only an ARM issue. To properly demonstrate it on MIPS and even without dual-issue, all the additions can be just changed with multiplications (because it is a long latency instruction). In this case we get:

unsigned int f1(unsigned int x)
{
    unsigned int a, b;
    a = x >> 1;
    b = x >> 2;
    a *= x >> 3;
    b *= x >> 4;
    a *= x >> 5;
    b *= x >> 6;
    a *= x >> 7;
    b *= x >> 8;
    a *= x >> 9;
    b *= x >> 10;
    a *= x >> 11;
    b *= x >> 12;
    a *= x >> 13;
    b *= x >> 14;
    a *= x >> 15;
    b *= x >> 16;
    a *= x >> 17;
    b *= x >> 18;
    a *= x >> 19;
    b *= x >> 20;
    a *= x >> 21;
    b *= x >> 22;
    a *= x >> 23;
    b *= x >> 24;
    return a * b;
}

unsigned int f2(unsigned int x)
{
    unsigned int a, b;
    a = x >> 1;
    b = x >> 2;
    a *= x >> 3;
    b *= x >> 4;
    a *= x >> 5;
    b *= x >> 6;
    a *= x >> 7;
    b *= x >> 8;
    a *= x >> 9;
    b *= x >> 10;
    a *= x >> 11;
    b *= x >> 12;
    a *= x >> 13;
    b *= x >> 14;
    a *= x >> 15;
    b *= x >> 16;
    a *= x >> 17;
    b *= x >> 18;
    a *= x >> 19;
    b *= x >> 20;
    a *= x >> 21;
    b *= x >> 22;
    a *= x >> 23;
    b *= x >> 24;
    asm ("" : "+r" (a));
    return a * b;
}

And the benchmark run on MIPS 74K:

$ gcc -O2 -march=mips32r2 -mtune=74kc -o badschedmul badschedmul.c
$ time ./badchedmul 1

real	0m34.934s
user	0m34.689s
sys	0m0.073s

$ time ./badchedmul 2

real	0m19.261s
user	0m19.122s
sys	0m0.050s

The symptoms are still the same. GCC just merges two independent calculations into a single dependency chain. While I would have expected it to be the other way around (breaking dependency chains to run faster on the target CPU).
Comment 4 Siarhei Siamashka 2012-12-09 11:21:42 UTC
Created attachment 28905 [details]
badschedmul.c

The testcase, converted to use multiplications. Can be used to demonstrates the problem on all architectures, even including x86-64.
Comment 5 Steven Bosscher 2012-12-09 11:36:39 UTC
Not target dependent, also shows on powerpc64.
Comment 6 Steven Bosscher 2012-12-09 12:13:03 UTC
Can you try using -frename-registers?

Without vs with -frename-registers:
mov r3, r0, lsr #2           mov r3, r0, lsr #2
add r3, r3, r0, lsr #1     | add r1, r3, r0, lsr #1
add r3, r3, r0, lsr #3     | add r2, r1, r0, lsr #3
add r3, r3, r0, lsr #4     | add ip, r2, r0, lsr #4
add r3, r3, r0, lsr #5     | add r3, ip, r0, lsr #5
add r3, r3, r0, lsr #6     | add r1, r3, r0, lsr #6
add r3, r3, r0, lsr #7     | add r2, r1, r0, lsr #7
add r3, r3, r0, lsr #8     | add ip, r2, r0, lsr #8
add r3, r3, r0, lsr #9     | add r3, ip, r0, lsr #9
add r3, r3, r0, lsr #10    | add r1, r3, r0, lsr #10 
add r3, r3, r0, lsr #11    | add r2, r1, r0, lsr #11 
add r3, r3, r0, lsr #12    | add ip, r2, r0, lsr #12 
add r3, r3, r0, lsr #13    | add r3, ip, r0, lsr #13 
add r3, r3, r0, lsr #14    | add r1, r3, r0, lsr #14 
add r3, r3, r0, lsr #15    | add r2, r1, r0, lsr #15 
add r3, r3, r0, lsr #16    | add ip, r2, r0, lsr #16 
add r3, r3, r0, lsr #17    | add r3, ip, r0, lsr #17 
add r3, r3, r0, lsr #18    | add r1, r3, r0, lsr #18 
add r3, r3, r0, lsr #19    | add r2, r1, r0, lsr #19 
add r3, r3, r0, lsr #20    | add ip, r2, r0, lsr #20 
add r3, r3, r0, lsr #21    | add r3, ip, r0, lsr #21 
add r3, r3, r0, lsr #22    | add r1, r3, r0, lsr #22 
add r3, r3, r0, lsr #23    | add r2, r1, r0, lsr #23 
add r0, r3, r0, lsr #24    | add r0, r2, r0, lsr #24 
bx  lr                       bx  lr
Comment 7 Steven Bosscher 2012-12-09 12:14:50 UTC
(In reply to comment #6)
> Can you try using -frename-registers?

(Not that I expect it to help because the dependencies are not really
broken, but maybe the in-core reg-renamer likes this form better...)
Comment 8 Richard Biener 2012-12-10 09:47:42 UTC
You can also adjust --param tree-reassoc-width or have the target implement
the sched.reassociation_width target hook (for the default).
Comment 9 Ramana Radhakrishnan 2012-12-11 16:33:25 UTC
(In reply to comment #8)
> You can also adjust --param tree-reassoc-width or have the target implement
> the sched.reassociation_width target hook (for the default).

I did try it on an A9 when the hook came out and it made bugger all difference to the benchmarks I cared about on a range of values. 


It's possible this might help on some of the newer cores. 

ramana