Bug 11488 - Pre-regalloc scheduling severely worsens performance
Summary: Pre-regalloc scheduling severely worsens performance
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 3.4.0
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: missed-optimization, ra
Depends on:
Blocks: 79149
  Show dependency treegraph
Reported: 2003-07-10 14:05 UTC by Falk Hueffner
Modified: 2017-01-20 20:06 UTC (History)
5 users (show)

See Also:
Target: alphaev68-*-*
Known to work:
Known to fail:
Last reconfirmed: 2005-12-24 20:44:01

Test case (1.09 KB, text/plain)
2003-07-10 15:18 UTC, Falk Hueffner

Note You need to log in before you can comment on or make changes to this bug.
Description Falk Hueffner 2003-07-10 14:05:19 UTC
This is a known problem, but there doesn't seem to be any good test case in the
database, and I think it's one of our worst performance problems currently.
Here's a nice test case:

gcc version 3.4 20030705 (experimental)

% gcc -O3 idct3.c && time ./a.out                    
./a.out  3.30s user 0.04s system 99% cpu 3.347 total    

% gcc -fno-schedule-insns -O3 idct3.c && time ./a.out
./a.out  1.19s user 0.00s system 101% cpu 1.171 total                       

i.e., slowdown of factor 2.7.

The new register allocator doesn't help a lot:

% gcc -O3 -fnew-ra idct3.c && time ./a.out
./a.out  3.09s user 0.04s system 100% cpu 3.128 total            

% gcc -fno-schedule-insns -fnew-ra -O3 idct3.c && time ./a.out
./a.out  1.24s user 0.00s system 97% cpu 1.276 total

The problem is that scheduling introduces false dependencies, which leads to
excessive spilling.
Comment 1 Falk Hueffner 2003-07-10 15:18:55 UTC
Created attachment 4378 [details]
Test case
Comment 2 Andrew Pinski 2003-07-11 05:28:40 UTC
I can confirm this behavior  on PowerPC. 
In fact it is even worse on PowerPC (TiBook G4 with Mac OS X 10.2.6):
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c 
[omni:~/src/gccPRs] pinskia% time ./a.out
6.600u 0.010s 0:06.83 96.7%     0+0k 0+1io 0pf+0w
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c -fno-schedule-insns
[omni:~/src/gccPRs] pinskia% time ./a.out
1.830u 0.000s 0:01.87 97.8%     0+0k 0+0io 0pf+0w
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c -fnew-ra
[omni:~/src/gccPRs] pinskia% time ./a.out
6.120u 0.020s 0:06.57 93.4%     0+0k 0+0io 0pf+0w
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c -fno-schedule-insns -fnew-ra
[omni:~/src/gccPRs] pinskia% time ./a.out
1.760u 0.000s 0:01.81 97.2%     0+0k 0+0io 0pf+0w

A factor of 3.6 slower is the code that gcc produces without -fno-schedule-insns.
Comment 3 Nathanael C. Nerode 2003-07-11 05:53:03 UTC
>The problem is that scheduling introduces false dependencies, which leads to
>excessive spilling.
This may be dumb, but why isn't scheduling run *after* the register allocator?  It seems altogether more logical to me, given the nature of the 
passes.  (Oh, right, that's an option, isn't it!  Do the numbers you're showing with -fno-schedule-insns reflect -fschedule-insns2 or 
-fno-schedule-insns2?  With post-regalloc scheduling, why is scheduling ever run before the register allocator?)

Failing that, I suppose we would have to preserve the pre-scheduling dependence information and feed it into the register allocator, which sounds 
quite difficult.

(Incidentally, I presume these problems are against the newest scheduler, the DFA one.  Target maintainers correct me if I'm confused.)
Comment 4 Andrew Pinski 2003-07-11 05:59:43 UTC
The Older Non-DFA one had the same problem on powerpc at least.
Here is the numbers with turning scheduling off:
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c -fno-schedule-insns -fnew-ra -fno-
[omni:~/src/gccPRs] pinskia% time ./a.out
1.920u 0.020s 0:02.45 79.1%     0+0k 7+1io 0pf+0w
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c -fno-schedule-insns -fno-schedule-insns2
[omni:~/src/gccPRs] pinskia% time ./a.out
2.070u 0.030s 0:02.41 87.1%     0+0k 0+0io 0pf+0w

Here is the numbers with just turning off the scheduling after the ra:
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c  -fno-schedule-insns2
[omni:~/src/gccPRs] pinskia% time ./a.out
9.820u 0.050s 0:11.10 88.9%     0+0k 0+1io 0pf+0w
[omni:~/src/gccPRs] pinskia% gcc -O3 idct3.c -fno-schedule-insns2 -fnew-ra
[omni:~/src/gccPRs] pinskia% time ./a.out
7.130u 0.030s 0:08.03 89.1%     0+0k 0+1io 0pf+0

You see that the scheduling before ra caused many problems and that the second 
schedular fixes some of those problems but it cannot fix all of them.
Comment 5 Nathanael C. Nerode 2003-07-11 06:20:45 UTC
In that case, the fix looks bloody obvious.  Best performance is with post-regalloc scheduling pass on and pre-regalloc scheduling pass off, so we 
should make that the default configuration.  I don't see any downside.  Why doesn't one of you propose that to the GCC mailing list and see if 
someone knows of a downside I don't?  (I should really get back to my configury work...)
Comment 6 Andrew Pinski 2003-07-15 14:54:45 UTC
Adding target of powerpc-apple-darwin6.6 since that is where I confirmed it on.
Comment 7 Falk Hueffner 2004-05-27 11:44:12 UTC
See also PR 15431 for a similar problem.
Comment 8 Andrew Pinski 2004-11-06 17:50:17 UTC
New timings from the mainline for PPC:
[zhivago:~/src/localgccPRs] pinskia% ~/local/bin/gcc -O2 pr11488.c -fno-schedule-insns[zhivago:~/
src/localgccPRs] pinskia% time ./a.out1.300u 0.010s 0:01.34 97.7%     0+0k 0+0io 0pf+0w[zhivago:~/
src/localgccPRs] pinskia% ~/local/bin/gcc -O2 pr11488.c 
[zhivago:~/src/localgccPRs] pinskia% time ./a.out
4.060u 0.020s 0:04.16 98.0%     0+0k 0+0io 0pf+0w
Comment 9 Andrew Pinski 2008-09-14 03:10:21 UTC
IRA improves this but it is still worse at -O2 than -O1 and than -O2 -fno-schedule-insns (which is the best so far).
Comment 10 Steven Bosscher 2011-05-22 15:02:03 UTC
Someone should try with -fsched-pressure...
Comment 11 Matt Turner 2016-05-15 00:12:52 UTC
(In reply to Steven Bosscher from comment #10)
> Someone should try with -fsched-pressure...

On alpha with gcc-5.3.0:

% gcc -O2 -mbwx idct3.c && time ./a.out 
./a.out  1.72s user 0.00s system 99% cpu 1.721 total

% gcc -O2 -fno-schedule-insns -mbwx idct3.c && time ./a.out 
./a.out  0.96s user 0.01s system 99% cpu 0.970 total

% gcc -O2 -fsched-pressure -mbwx idct3.c && time ./a.out 
./a.out  1.01s user 0.00s system 99% cpu 1.016 total

(-mbwx is needed, otherwise -fsched-pressure/-fno-schedule-insns doesn't show any benefit)

So it looks like -fsched-pressure helps significantly, but not quite as much as -fno-schedule-insns.
Comment 12 Pat Haugen 2016-12-21 19:16:04 UTC
Author: pthaugen
Date: Wed Dec 21 19:15:32 2016
New Revision: 243866

URL: https://gcc.gnu.org/viewcvs?rev=243866&root=gcc&view=rev
	PR rtl-optimization/11488
	* common/config/rs6000/rs6000-common.c
	(rs6000_option_optimization_table): Enable -fsched-pressure.
	* config/rs6000/rs6000.c (TARGET_COMPUTE_PRESSURE_CLASSES): Define
	target hook.
	(rs6000_option_override_internal): Set default -fsched-pressure algorithm.
	(rs6000_compute_pressure_classes): Implement target hook.

Comment 13 Pat Haugen 2016-12-21 20:03:01 UTC
Fixed on powerpc. Testcase times (at 8X original loop count to get measurable times).

base: 4.436 sec
base + -fno-schedule-insns: 2.052 sec
base + patch: 1.815 sec
Comment 14 Pat Haugen 2017-01-18 15:55:47 UTC
(In reply to Pat Haugen from comment #13)
> Fixed on powerpc.

So removing it from target list.