This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug optimization/15242] New: pessimization of "goto *"
- From: "anton at mips dot complang dot tuwien dot ac dot at" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 1 May 2004 14:27:03 -0000
- Subject: [Bug optimization/15242] New: pessimization of "goto *"
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
This is essentially the 3.4.0 incarnation of the problem discussed at
the end of PR8092. mmitchel suggested to create a new bug report, so
here it is:
Code example and observed behaviour
-----------------------------------
In the attached code, the fragment between I_question_branch and
J_question_branch_lp_plus_store_number contains the following code
(slightly edited for readability):
I_question_branch:
{
Cell * a_target;
Bool f;
;
(( a_target )=(Cell *)( ( (* (ip) ) ) )) ;
(( f )=(Bool)( (sp[0]) )) ;
({ ip+=( 1 );}) ;
sp += 1;
{
if (f==0) {
({ip=( (Xt *)a_target ); ;}) ;
;
(ip++) ;
;
({asm("":"=X"(cfa)); goto **(ip-1);}) ;
}
;
}
;
(ip++) ;
;
K_question_branch :
({asm("":"=X"(cfa)); goto **(ip-1);}) ;
}
J_question_branch_lp_plus_store_number :
When the file is compiled with
gcc-3.4.0 -I../../gforth-0.6.2/engine/../arch/386 -I. -Wall -O2
-fomit-frame-pointer -fforce-addr -fforce-mem -march=pentium -DHAVE_CONFIG_H
-DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/share/gforth/site-forth:/usr/local/lib/gforth/0.6.2:/usr/local/share/gforth/0.6.2"'
-fno-gcse -fno-strict-aliasing -fno-defer-pop -fcaller-saves -fno-inline -S ef.i
-o ef-3.4.0-def.s
this produces the following code:
.L20:
movl (%edi), %eax
movl (%ebp), %edx
addl $4, %edi
addl $4, %ebp
testl %eax, %eax
jne .L1073
leal 4(%edx), %ebp
movl -4(%ebp), %eax
jmp .L1373
.L739:
... #actually the following fragment comes earlier
.L1073:
addl $4, %ebp
.L376:
movl -4(%ebp), %eax
jmp .L1373
... #the following fragment comes even earlier.
.L1373:
jmp *%eax
What is the problem?
--------------------
The main problem here is that the code jumps to a shared "jmp *eax"
instead of just putting "jmp *eax" (or, in this case, preferably "jmp
*-4(%ebp)") where the jump to the shared indirect jump is. All
indirect jumps are converted into jumps to a single shared indirect
jump in this way.
This obstructs effective usage of the BTB for branch prediction on
CPUs such as the Athlon, Opteron, and Pentium III (see below), and
most interpreters using GCC's labels-as-values extension will be
affected by this.
It also makes it impossible for us to use dynamic superinstructions
portably (aka selective inlining, PLDI'98 p. 291). I could eventually
find ways to work around the problems we have had with 3.2 and 3.3,
but this time I don't see a workaround (and a quick look for new
options revealed nothing).
A smaller problem is that the basic blocks are distributed throughout
the code unless we use -fno-reorder-blocks; this reduces the number of
cases where we can apply dynamic superinstructions. For gcc-3.3,
using -fno-crossjumping suppressed the main problem, but not in
combination with -fno-reorder-blocks. So if you fix the main problem,
it would be nice if it also worked with -fno-reorder-blocks.
How much does this cost in performance?
---------------------------------------
Here are the results of doing a
configure --enable-force-reg CC=...; make; make onebench
ENGINE_FAST="./gforth-fast --dynamic"; make onebench ENGINE_FAST="./gforth-fast
--no-dynamic"
for gforth-0.6.2 with various compilers (The ef.i file used for the
code example above was also created from gforth-0.6.2, with gcc-2.95
and without --enable-force-reg):
Pentium III 1000MHz; numbers are times in seconds user time
gcc-2.95.3 gcc-3.3 gcc-3.4.0
dynamic no-dyn dynamic no-dyn no-dyn
siev 0.56 0.85 0.67 0.90 2.31
bubble 0.72 1.28 0.93 1.35 2.68
matrix 0.35 1.43 0.35 1.38 1.90
fib 0.76 1.11 0.95 1.27 2.90
Here you see a factor 3.7-5.4 between the best result (gcc-2.95
dynamic) and the gcc-3.4.0 result for matrix. This slowdown has two
main components:
- we cannot apply the dynamic optimizations (and I don't see a
portable way to work around this problem), resulting in a slowdown
factors of 1.5-4.
- from the shared indirect jump; the jump to this indirect jump costs
something, but the more important component of the slowdown is
probably the reduction in branch prediction accuracy for the indirect
jumps; the resulting slowdown is about a factor of 1.4-2.5.
On the Athlon I have seen similar results in the past.
Here are results from a Pentium 4:
Pentium 4 2.26GHz; numbers are times in seconds user time
gcc-2.95.3 gcc-3.3 gcc-3.4.0
dynamic no-dyn dynamic no-dyn no-dyn
siev 0.24 0.48 0.31 0.47 0.50
bubble 0.30 0.78 0.36 0.77 0.78
matrix 0.19 0.94 0.17 0.92 0.96
fib 0.34 0.57 0.41 0.58 0.59
On the Pentium 4 the second component apparently does not play a big
role, probably due to the effects of the trace cache, but the first
component results in even higher slowdowns (1.7-5) than on the
PentiumIII.
Do flag variations change the problem?
--------------------------------------
Adding -fno-reorder-buffers moves .L1073 right below the first "jmp
.L1373" (as it should), but the jumps to .L1373 are still there.
Adding -fno-crossjumping in either case changes essentially nothing.
Reducing the command line to
gcc-3.4.0 -Wall -O2 -fomit-frame-pointer -S ef.i -o ef-3.4.0-nofs.s
produces similar results (but the register allocation is worse).
Suggested fix
-------------
AFAIK gcc introduces the shared indirect jump for the benefit of
data-flow analysis. In gcc-3.3 with some combinations of flags this
change was undone later (but the pass that did this was suppressed
with -fno-reorder-jumps). In 3.4.0 the undoing is apparently
completely disabled.
One fix would be to put the undoing into a pass that is invoked after
the data-flow analysis, but before the last combining pass, and that
is independent of flags such as -fno-reorder-blocks (maybe a separate
pass?). An alternative would be to produce the desired control-flow
graph for data-flow analysis without changing the code (i.e., without
introducing shared jumps).
I might give writing one of these approaches pass a try, but would
prefer to see it done by people who know more about gcc internals than
I do. In case I do it, I would apreciate feedback and suggestions on
how to proceed.
- anton
--
Summary: pessimization of "goto *"
Product: gcc
Version: 3.4.0
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: anton at mips dot complang dot tuwien dot ac dot at
CC: bernd dot paysan at gmx dot de,gcc-bugs at gcc dot gnu
dot org
GCC build triplet: i686-pc-linux-gnu
GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242