[Bug optimization/15242] New: pessimization of "goto *"

Sat May 1 14:27:00 GMT 2004

This is essentially the 3.4.0 incarnation of the problem discussed at
the end of PR8092.  mmitchel suggested to create a new bug report, so
here it is:

Code example and observed behaviour
-----------------------------------

In the attached code, the fragment between I_question_branch and
J_question_branch_lp_plus_store_number contains the following code
(slightly edited for readability):

I_question_branch:   
{
  Cell * a_target;
  Bool f;
  ;
  (( a_target )=(Cell *)( ( (* (ip) )   )  )) ;
  (( f )=(Bool)( (sp[0])  )) ;
  ({ ip+=( 1 );}) ;
  sp += 1;
  {
    if (f==0) {
      ({ip=( (Xt *)a_target );  ;}) ;
      ;
      (ip++) ;
      ;
      ({asm("":"=X"(cfa));  goto **(ip-1);}) ;
    }
    ;
  }
  ;
  (ip++) ;
  ;
K_question_branch : 
  ({asm("":"=X"(cfa));  goto **(ip-1);}) ;
}
J_question_branch_lp_plus_store_number :

When the file is compiled with

gcc-3.4.0  -I../../gforth-0.6.2/engine/../arch/386 -I. -Wall -O2
-fomit-frame-pointer -fforce-addr -fforce-mem -march=pentium -DHAVE_CONFIG_H
-DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/share/gforth/site-forth:/usr/local/lib/gforth/0.6.2:/usr/local/share/gforth/0.6.2"'
-fno-gcse -fno-strict-aliasing -fno-defer-pop -fcaller-saves -fno-inline -S ef.i
-o ef-3.4.0-def.s

this produces the following code:

.L20:
	movl	(%edi), %eax
	movl	(%ebp), %edx
	addl	$4, %edi
	addl	$4, %ebp
	testl	%eax, %eax
	jne	.L1073
	leal	4(%edx), %ebp
	movl	-4(%ebp), %eax
	jmp	.L1373
.L739:
... #actually the following fragment comes earlier
.L1073:
	addl	$4, %ebp
.L376:
	movl	-4(%ebp), %eax
	jmp	.L1373
... #the following fragment comes even earlier.
.L1373: 
	jmp	*%eax

What is the problem?
--------------------

The main problem here is that the code jumps to a shared "jmp *eax"
instead of just putting "jmp *eax" (or, in this case, preferably "jmp
*-4(%ebp)") where the jump to the shared indirect jump is.  All
indirect jumps are converted into jumps to a single shared indirect
jump in this way.

This obstructs effective usage of the BTB for branch prediction on
CPUs such as the Athlon, Opteron, and Pentium III (see below), and
most interpreters using GCC's labels-as-values extension will be
affected by this.

It also makes it impossible for us to use dynamic superinstructions
portably (aka selective inlining, PLDI'98 p. 291).  I could eventually
find ways to work around the problems we have had with 3.2 and 3.3,
but this time I don't see a workaround (and a quick look for new
options revealed nothing).

A smaller problem is that the basic blocks are distributed throughout
the code unless we use -fno-reorder-blocks; this reduces the number of
cases where we can apply dynamic superinstructions.  For gcc-3.3,
using -fno-crossjumping suppressed the main problem, but not in
combination with -fno-reorder-blocks.  So if you fix the main problem,
it would be nice if it also worked with -fno-reorder-blocks.

How much does this cost in performance?
---------------------------------------

Here are the results of doing a

configure --enable-force-reg CC=...; make; make onebench
ENGINE_FAST="./gforth-fast --dynamic"; make onebench ENGINE_FAST="./gforth-fast
--no-dynamic"

for gforth-0.6.2 with various compilers (The ef.i file used for the
code example above was also created from gforth-0.6.2, with gcc-2.95
and without --enable-force-reg):

Pentium III 1000MHz; numbers are times in seconds user time
	   gcc-2.95.3	   gcc-3.3	gcc-3.4.0
	dynamic no-dyn	dynamic	no-dyn	no-dyn
siev    0.56    0.85    0.67    0.90    2.31
bubble  0.72	1.28	0.93	1.35	2.68
matrix  0.35	1.43	0.35	1.38	1.90
fib     0.76	1.11	0.95	1.27	2.90

Here you see a factor 3.7-5.4 between the best result (gcc-2.95
dynamic) and the gcc-3.4.0 result for matrix.  This slowdown has two
main components:

- we cannot apply the dynamic optimizations (and I don't see a
portable way to work around this problem), resulting in a slowdown
factors of 1.5-4.

- from the shared indirect jump; the jump to this indirect jump costs
something, but the more important component of the slowdown is
probably the reduction in branch prediction accuracy for the indirect
jumps; the resulting slowdown is about a factor of 1.4-2.5.

On the Athlon I have seen similar results in the past.

Here are results from a Pentium 4:

Pentium 4 2.26GHz; numbers are times in seconds user time
	   gcc-2.95.3	   gcc-3.3	gcc-3.4.0
	dynamic no-dyn	dynamic	no-dyn	no-dyn
siev    0.24    0.48    0.31    0.47    0.50
bubble  0.30	0.78	0.36	0.77	0.78
matrix  0.19	0.94	0.17	0.92	0.96
fib     0.34	0.57	0.41	0.58	0.59

On the Pentium 4 the second component apparently does not play a big
role, probably due to the effects of the trace cache, but the first
component results in even higher slowdowns (1.7-5) than on the
PentiumIII.

Do flag variations change the problem?
--------------------------------------

Adding -fno-reorder-buffers moves .L1073 right below the first "jmp
.L1373" (as it should), but the jumps to .L1373 are still there.
Adding -fno-crossjumping in either case changes essentially nothing.
Reducing the command line to

gcc-3.4.0 -Wall -O2 -fomit-frame-pointer -S ef.i -o ef-3.4.0-nofs.s

produces similar results (but the register allocation is worse).

Suggested fix
-------------

AFAIK gcc introduces the shared indirect jump for the benefit of
data-flow analysis.  In gcc-3.3 with some combinations of flags this
change was undone later (but the pass that did this was suppressed
with -fno-reorder-jumps).  In 3.4.0 the undoing is apparently
completely disabled.

One fix would be to put the undoing into a pass that is invoked after
the data-flow analysis, but before the last combining pass, and that
is independent of flags such as -fno-reorder-blocks (maybe a separate
pass?).  An alternative would be to produce the desired control-flow
graph for data-flow analysis without changing the code (i.e., without
introducing shared jumps).

I might give writing one of these approaches pass a try, but would
prefer to see it done by people who know more about gcc internals than
I do.  In case I do it, I would apreciate feedback and suggestions on
how to proceed.

- anton

-- 
           Summary: pessimization of "goto *"
           Product: gcc
           Version: 3.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: anton at mips dot complang dot tuwien dot ac dot at
                CC: bernd dot paysan at gmx dot de,gcc-bugs at gcc dot gnu
                    dot org
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242