Bug 15242 - [3.4 regression] pessimization of "goto *"
Summary: [3.4 regression] pessimization of "goto *"
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 3.4.0
: P2 enhancement
Target Milestone: 4.0.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization, patch
: 8092 (view as bug list)
Depends on: 15243 16585 20648
Blocks:
  Show dependency treegraph
 
Reported: 2004-05-01 14:26 UTC by anton
Modified: 2009-09-04 14:40 UTC (History)
8 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2005-10-07 03:18:56


Attachments
test input (file ef.i mentioned in the bug report) (46.15 KB, text/plain)
2004-05-01 14:30 UTC, anton
Details
Updated patch for this problem (2.24 KB, patch)
2005-01-27 08:33 UTC, Steven Bosscher
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description anton 2004-05-01 14:26:55 UTC
This is essentially the 3.4.0 incarnation of the problem discussed at
the end of PR8092.  mmitchel suggested to create a new bug report, so
here it is:

Code example and observed behaviour
-----------------------------------

In the attached code, the fragment between I_question_branch and
J_question_branch_lp_plus_store_number contains the following code
(slightly edited for readability):

I_question_branch:   
{
  Cell * a_target;
  Bool f;
  ;
  (( a_target )=(Cell *)( ( (* (ip) )   )  )) ;
  (( f )=(Bool)( (sp[0])  )) ;
  ({ ip+=( 1 );}) ;
  sp += 1;
  {
    if (f==0) {
      ({ip=( (Xt *)a_target );  ;}) ;
      ;
      (ip++) ;
      ;
      ({asm("":"=X"(cfa));  goto **(ip-1);}) ;
    }
    ;
  }
  ;
  (ip++) ;
  ;
K_question_branch : 
  ({asm("":"=X"(cfa));  goto **(ip-1);}) ;
}
J_question_branch_lp_plus_store_number :

When the file is compiled with

gcc-3.4.0  -I../../gforth-0.6.2/engine/../arch/386 -I. -Wall -O2
-fomit-frame-pointer -fforce-addr -fforce-mem -march=pentium -DHAVE_CONFIG_H
-DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/share/gforth/site-forth:/usr/local/lib/gforth/0.6.2:/usr/local/share/gforth/0.6.2"'
-fno-gcse -fno-strict-aliasing -fno-defer-pop -fcaller-saves -fno-inline -S ef.i
-o ef-3.4.0-def.s

this produces the following code:

.L20:
	movl	(%edi), %eax
	movl	(%ebp), %edx
	addl	$4, %edi
	addl	$4, %ebp
	testl	%eax, %eax
	jne	.L1073
	leal	4(%edx), %ebp
	movl	-4(%ebp), %eax
	jmp	.L1373
.L739:
... #actually the following fragment comes earlier
.L1073:
	addl	$4, %ebp
.L376:
	movl	-4(%ebp), %eax
	jmp	.L1373
... #the following fragment comes even earlier.
.L1373: 
	jmp	*%eax


What is the problem?
--------------------

The main problem here is that the code jumps to a shared "jmp *eax"
instead of just putting "jmp *eax" (or, in this case, preferably "jmp
*-4(%ebp)") where the jump to the shared indirect jump is.  All
indirect jumps are converted into jumps to a single shared indirect
jump in this way.

This obstructs effective usage of the BTB for branch prediction on
CPUs such as the Athlon, Opteron, and Pentium III (see below), and
most interpreters using GCC's labels-as-values extension will be
affected by this.

It also makes it impossible for us to use dynamic superinstructions
portably (aka selective inlining, PLDI'98 p. 291).  I could eventually
find ways to work around the problems we have had with 3.2 and 3.3,
but this time I don't see a workaround (and a quick look for new
options revealed nothing).

A smaller problem is that the basic blocks are distributed throughout
the code unless we use -fno-reorder-blocks; this reduces the number of
cases where we can apply dynamic superinstructions.  For gcc-3.3,
using -fno-crossjumping suppressed the main problem, but not in
combination with -fno-reorder-blocks.  So if you fix the main problem,
it would be nice if it also worked with -fno-reorder-blocks.


How much does this cost in performance?
---------------------------------------

Here are the results of doing a

configure --enable-force-reg CC=...; make; make onebench
ENGINE_FAST="./gforth-fast --dynamic"; make onebench ENGINE_FAST="./gforth-fast
--no-dynamic"

for gforth-0.6.2 with various compilers (The ef.i file used for the
code example above was also created from gforth-0.6.2, with gcc-2.95
and without --enable-force-reg):

Pentium III 1000MHz; numbers are times in seconds user time
	   gcc-2.95.3	   gcc-3.3	gcc-3.4.0
	dynamic no-dyn	dynamic	no-dyn	no-dyn
siev    0.56    0.85    0.67    0.90    2.31
bubble  0.72	1.28	0.93	1.35	2.68
matrix  0.35	1.43	0.35	1.38	1.90
fib     0.76	1.11	0.95	1.27	2.90

Here you see a factor 3.7-5.4 between the best result (gcc-2.95
dynamic) and the gcc-3.4.0 result for matrix.  This slowdown has two
main components:

- we cannot apply the dynamic optimizations (and I don't see a
portable way to work around this problem), resulting in a slowdown
factors of 1.5-4.

- from the shared indirect jump; the jump to this indirect jump costs
something, but the more important component of the slowdown is
probably the reduction in branch prediction accuracy for the indirect
jumps; the resulting slowdown is about a factor of 1.4-2.5.

On the Athlon I have seen similar results in the past.

Here are results from a Pentium 4:

Pentium 4 2.26GHz; numbers are times in seconds user time
	   gcc-2.95.3	   gcc-3.3	gcc-3.4.0
	dynamic no-dyn	dynamic	no-dyn	no-dyn
siev    0.24    0.48    0.31    0.47    0.50
bubble  0.30	0.78	0.36	0.77	0.78
matrix  0.19	0.94	0.17	0.92	0.96
fib     0.34	0.57	0.41	0.58	0.59

On the Pentium 4 the second component apparently does not play a big
role, probably due to the effects of the trace cache, but the first
component results in even higher slowdowns (1.7-5) than on the
PentiumIII.


Do flag variations change the problem?
--------------------------------------

Adding -fno-reorder-buffers moves .L1073 right below the first "jmp
.L1373" (as it should), but the jumps to .L1373 are still there.
Adding -fno-crossjumping in either case changes essentially nothing.
Reducing the command line to

gcc-3.4.0 -Wall -O2 -fomit-frame-pointer -S ef.i -o ef-3.4.0-nofs.s

produces similar results (but the register allocation is worse).


Suggested fix
-------------

AFAIK gcc introduces the shared indirect jump for the benefit of
data-flow analysis.  In gcc-3.3 with some combinations of flags this
change was undone later (but the pass that did this was suppressed
with -fno-reorder-jumps).  In 3.4.0 the undoing is apparently
completely disabled.

One fix would be to put the undoing into a pass that is invoked after
the data-flow analysis, but before the last combining pass, and that
is independent of flags such as -fno-reorder-blocks (maybe a separate
pass?).  An alternative would be to produce the desired control-flow
graph for data-flow analysis without changing the code (i.e., without
introducing shared jumps).

I might give writing one of these approaches pass a try, but would
prefer to see it done by people who know more about gcc internals than
I do.  In case I do it, I would apreciate feedback and suggestions on
how to proceed.

- anton
Comment 1 anton 2004-05-01 14:30:56 UTC
Created attachment 6206 [details]
test input (file ef.i mentioned in the bug report)
Comment 2 Andrew Pinski 2004-05-01 14:38:58 UTC
Note jumping out of a statement expression is really illegal and should not be done.
Comment 3 Andrew Pinski 2004-05-01 14:42:23 UTC
One more thing statement expressions should end with a non-void expression.
Comment 4 anton 2004-05-01 15:03:54 UTC
Subject: Re:  pessimization of "goto *"

pinskia at gcc dot gnu dot org wrote:
> Note jumping out of a statement expression is really illegal and should not be done.

Statement expressions are used here just to produce a way to include a
statement sequence in a macro.  Do you recommend that we should switch
from 

({ ... })

to

do { ... } while (0)

?

Or do you have a different suggestion?

BTW, looking in the gcc-3.4.0 manual, I don't see the restrictions you
mention, and gcc-3.4.0 -Wall also does not complain about this usage.
Actually the manual explicitly says:

|(If you use some other kind of statement last within the braces, the
|construct has type `void', and thus effectively no value.)

which I would interpret as "Anything goes, as long as you don't use
the value of the statement expression".

- anton
Comment 5 Andrew Pinski 2004-05-01 15:13:12 UTC
The reason why I say this is because of on the tree-ssa, it does warn about using about void statements 
as the end, and on the tree-ssa right now ICEs on the tree-ssa, see PR 15243.  Also this is not 
documented there is another bug for that, PR 772.
Comment 6 anton 2004-05-02 07:24:28 UTC
Subject: Re:  pessimization of "goto *"

pinskia at gcc dot gnu dot org wrote:
> 
> 
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-05-01 15:13 -------
> The reason why I say this is because of on the tree-ssa, it does warn about using about void statements 
> as the end, and on the tree-ssa right now ICEs on the tree-ssa, see PR 15243.  Also this is not 
> documented there is another bug for that, PR 772.

And that refers to a discussion in the mailing list, and the only
mention of a problem with jumping out that I found there was in
<http://gcc.gnu.org/ml/gcc/2000-07/msg00721.html>.  For this example
the semantics seem pretty obvious and well-defined to me (in contrast
to the jump-into example), and this was just a compiler bug.

If you find this bug hard to fix, it's ok with me if you document that
jumping out does not work and give a warning (and eventually an error)
on compiling it, and I will work around this restriction.

If you want, I can also prepare a version of the test file that uses
do...while(0) instead of statement expressions in some places.

- anton
Comment 7 Joseph S. Myers 2004-05-02 19:06:34 UTC
Subject: Re:  pessimization of "goto *"

On Sun, 2 May 2004, anton at a0 dot complang dot tuwien dot ac dot at wrote:

> And that refers to a discussion in the mailing list, and the only
> mention of a problem with jumping out that I found there was in
> <http://gcc.gnu.org/ml/gcc/2000-07/msg00721.html>.  For this example
> the semantics seem pretty obvious and well-defined to me (in contrast
> to the jump-into example), and this was just a compiler bug.

The semantics are neither obvious nor well-defined, because other parts of
the larger expression containing the statement expression might or might
not have been evaluated.  The only thing in C giving rise to such a
situation (where parts of an expression might or might not have been
evaluated when a jump occurs) is setjmp/longjmp, and you should be able to
use longjmp in statement expressions; goto can't have such effects in C so
it's dubious allowing them in GNU C.

Comment 8 anton 2004-05-04 15:06:27 UTC
Subject: Re:  pessimization of "goto *"

jsm at polyomino dot org dot uk wrote:
> 
> 
> ------- Additional Comments From jsm at polyomino dot org dot uk  2004-05-02 19:06 -------
> Subject: Re:  pessimization of "goto *"
> 
> On Sun, 2 May 2004, anton at a0 dot complang dot tuwien dot ac dot at wrote:
> 
> > And that refers to a discussion in the mailing list, and the only
> > mention of a problem with jumping out that I found there was in
> > <http://gcc.gnu.org/ml/gcc/2000-07/msg00721.html>.  For this example
> > the semantics seem pretty obvious and well-defined to me (in contrast
> > to the jump-into example), and this was just a compiler bug.
> 
> The semantics are neither obvious nor well-defined, because other parts of
> the larger expression containing the statement expression might or might
> not have been evaluated.

The example is:

  a:
     f ("%d\n", ({goto a;})

The only other part that might or might not have been evaluated is the
"%d\n"; since this has no side effect, it does not matter whether it
is evaluated or not.  So the obvious (to me) and well-defined
semantics of this example is to be equivalent to

a: goto a;

In general, true, if the other parts of the expression have side
effects, the whole expression is undefined, but that's just the same
as in other cases with side effects in expression evaluation, so the
semantics still seem obvious to me.

>  The only thing in C giving rise to such a
> situation (where parts of an expression might or might not have been
> evaluated when a jump occurs) is setjmp/longjmp, and you should be able to
> use longjmp in statement expressions; goto can't have such effects in C so
> it's dubious allowing them in GNU C.

Variable definitions also give rise to such situations (where parts of
an expression might or might not have been evaluated when the
definition occurs), and this may also lead to undefined behaviour, and
this cannot happen in C.  Does this make allowing variable definitions
in statement expressions dubious in GNU C?

I don't think so (after all, they are used in the first example for
statement expressions), and I don't think that this kind of argument
is valid for other things that cannot occur in C expressions, either.

If only things were allowed in statement expressions that are allowed
in C expressions, then statement expressions would be redundant; you
could simply write "(expr, expr, expr)" instead of "({expr; expr;
expr;})".

- anton


Comment 9 Andrew Pinski 2004-05-06 04:45:33 UTC
Confirmed a reduced testcase:
int f(int i, int j)
{
  static void *a[5] = {&&b,&&c,&&d,&&e,&&f};

    if (i >=5) i = 0; if (i<=0)i=5;
    goto *a[i];
  e:
    i++;  if (i >=5) i = 0; if (i<=0)i=5;
    j++;
    goto *a[i];
  f:
    i--;  if (i >=5) i = 0; if (i<=0)i=5;
    j--;
    goto *a[i];
  b:
    i+=2;  if (i >=5) i = 0; if (i<=0)i=5;
    goto *a[i];
  c:
    i*=2;  if (i >=5) i = 0; if (i<=0)i=5;
    goto *a[i];
  d:;
}
Comment 10 Josef Zlomek 2004-05-24 14:03:34 UTC
Proposed patch:
http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01526.html
Comment 11 Andrew Pinski 2004-05-26 12:29:05 UTC
Newest Patch here: <http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01670.html>.
Comment 12 anton 2004-07-06 20:35:29 UTC
Subject: Re:  pessimization of "goto *"

pinskia at gcc dot gnu dot org wrote:
> Newest Patch here: <http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01670.html>.

I tried this patch with gcc-3.5-20040523, and the problem was still
there, for both the reduced test case, and a variant (work around PR
15243) of the original testcase.

I also tried this patch with gcc-core-3.5-20040704, and it applies
cleanly, but then produces a compilation error:

../../gcc-3.5-20040704/gcc/bb-reorder.c: In function `duplicate_computed_gotos':
../../gcc-3.5-20040704/gcc/bb-reorder.c:1989: error: structure has no member named `computed_goto_common_label'
make[2]: *** [bb-reorder.o] Error 1
make[2]: Leaving directory `/nfs/nfstmp/anton/gcc-patched/gcc-3.5-20040704-i386/gcc'

- anton
Comment 13 Andrew Pinski 2004-07-06 22:00:15 UTC
It looks like you can remove:
+   if (!cfun->computed_goto_common_label)
+     return;

from the patch.
Comment 14 zlomj9am 2004-07-07 04:49:00 UTC
Subject: Re:  pessimization of "goto *"

> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-07-06 22:00 -------
> It looks like you can remove:
> +   if (!cfun->computed_goto_common_label)
> +     return;
> 
> from the patch.

cfun->computed_goto_common_label can be replaced by cfun->has_computed_jump.
The test is there not to scan the function if the computed goto is not
used in the function.

Josef
Comment 15 Josef Zlomek 2004-07-07 13:34:20 UTC
> I tried this patch with gcc-3.5-20040523, and the problem was still
there

The optimization is enabled at -O2 (or -O -freorder-blocks).
If there is a uncond jump to a block containing a computed jump and the block is
small enough (several instructions) the block is copied to eliminate the uncond
jump.

When I was testing the patch with current mainline I have noticed that the
current_function_has_computed_jump flag is incorrectly unset in make_edges()
when compiling ef.i. IMHO the flag should not be set initialized to 0 (or 1 if
flag_reorder_blocks_and_partition) in make_edges() but somewhere earlier because
make_edges() can be called several times and only a subset of blocks may be
scanned for computed jumps in it.
Comment 16 anton 2004-07-08 13:03:21 UTC
Subject: Re:  pessimization of "goto *"

zlomek at gcc dot gnu dot org wrote:
> The optimization is enabled at -O2

Not as far as I can tell (tried it on i386 with patched versions of
gcc-3.5-20040704 and gcc-3.5-20040523).

> (or -O -freorder-blocks).

That seems to work, though.

We would like to use -fno-reorder-blocks with
duplicate_computed_gotos, either through an extra option, or by just
having duplicate_computed_gotos all the time (at least for our code
these gotos were duplicated already in the source code, so duplicating
them again in the back end is closer to the idea of
-fno-reorder-blocks than not duplicating them).

> If there is a uncond jump to a block containing a computed jump and the block is
> small enough (several instructions) the block is copied to eliminate the uncond
> jump.
> 
> When I was testing the patch with current mainline I have noticed that the
> current_function_has_computed_jump flag is incorrectly unset in make_edges()
> when compiling ef.i.

Ok, that explains why this patch does not work for ef.i.

One other thing I noticed is that apparently there is no combining
pass (or its tree-ssa equivalent) after duplicate_computed_gotos,
resulting in compiling code like

    goto *a[i];

into

        movl    a.0(,%ebx,4), %eax
        jmp     *%eax

instead of

	jmp	*a.0(,%ebx,4)

Thanks for your efforts,

- anton
Comment 17 zlomj9am 2004-07-16 07:01:31 UTC
Subject: Re:  pessimization of "goto *"

> > The optimization is enabled at -O2
> Not as far as I can tell (tried it on i386 with patched versions of
> gcc-3.5-20040704 and gcc-3.5-20040523).

Yes, it is enabled by the patch. The patch duplicates block with computed goto
to place where a uncond jump to the conputed goto block is if the block is
small enough.
The computed jump in the small testcase is not duplicated, because the uncond
jumps from other places do not jump to a basic block which contains the
computed jump but to other block:

f:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ebx
        movl    8(%ebp), %ebx
        .p2align 2,,3
.L37:
        cmpl    $4, %ebx
        jle     .L39
.L34:
        movl    $5, %ebx
.L9:
        movl    a.1118(,%ebx,4), %eax
        jmp     *%eax
        .p2align 2,,3
.L39:
        testl   %ebx, %ebx
        jg      .L9
        jmp     .L34
        .p2align 2,,3
.L3:
        popl    %ebx
        leave
        ret
        .p2align 2,,3
.L2:
        sall    %ebx
        jmp     .L37
        .p2align 2,,3
.L1:
        addl    $2, %ebx
        jmp     .L37
        .p2align 2,,3
.L5:
        decl    %ebx
        jmp     .L37
        .p2align 2,,3
.L4:
        incl    %ebx
        jmp     .L37

However, I see that after the L39 there is a cond jump to the computed jump block.
This is not handled by my patch but could be.

> We would like to use -fno-reorder-blocks with
> duplicate_computed_gotos, either through an extra option, or by just
> having duplicate_computed_gotos all the time (at least for our code
> these gotos were duplicated already in the source code, so duplicating
> them again in the back end is closer to the idea of
> -fno-reorder-blocks than not duplicating them).

Adding an new flag is possible. However, I do not see why you explicitelly want
to disable reorder-blocks.

> One other thing I noticed is that apparently there is no combining
> pass (or its tree-ssa equivalent) after duplicate_computed_gotos,
> resulting in compiling code like
> 
>     goto *a[i];
> 
> into
> 
>         movl    a.0(,%ebx,4), %eax
>         jmp     *%eax
> 
> instead of
> 
> 	jmp	*a.0(,%ebx,4)

The combiner should have already combined these insns before the duplication of
computed gotos is run. This pass just duplicates some insns.

Josef
Comment 18 anton 2004-07-16 08:17:36 UTC
Subject: Re:  pessimization of "goto *"

zlomj9am at artax dot karlin dot mff dot cuni dot cz wrote:
> > > The optimization is enabled at -O2
> > Not as far as I can tell (tried it on i386 with patched versions of
> > gcc-3.5-20040704 and gcc-3.5-20040523).
> 
> Yes, it is enabled by the patch. The patch duplicates block with computed goto
> to place where a uncond jump to the conputed goto block is if the block is
> small enough.
> The computed jump in the small testcase is not duplicated, because the uncond
> jumps from other places do not jump to a basic block which contains the
> computed jump but to other block:
...

Thanks for opening my eyes.

> > We would like to use -fno-reorder-blocks with
> > duplicate_computed_gotos, either through an extra option, or by just
> > having duplicate_computed_gotos all the time (at least for our code
> > these gotos were duplicated already in the source code, so duplicating
> > them again in the back end is closer to the idea of
> > -fno-reorder-blocks than not duplicating them).
> 
> Adding an new flag is possible. However, I do not see why you explicitelly want
> to disable reorder-blocks.

We need it for the dynamic superinstruction optimization (aka
selective inlining, PLDI'98 p. 291) in Gforth and further
optimizations building on that (other people working on efficient
interpreters/dynamic code generators have the same problem; e.g., I
recently talked to Etienne Gagnon (http://sablevm.org/) about this;
qemu also uses -fno-reorder-blocks).

For dynamic superinstructions we need to ensure that the machine code
corresponding to source code between two labels is between the
addresses that gcc produces for the labels.  The flag
-fno-reorder-blocks almost gives us that (but the sharing of the
"goto *"s breaks that).

Even for just plain interpreters (without dynamic superinstructions),
the sharing of the "goto *"s resulting from the merging of code after
conditional jumps (as shown in the simplified testcase) will often
reduce performance significantly through lower BTB accuracy.  So for
performance one probably wants to use -fno-reorder-blocks there,
unless this results in merging the "goto *"s, as it does now.

> > One other thing I noticed is that apparently there is no combining
> > pass (or its tree-ssa equivalent) after duplicate_computed_gotos,
> > resulting in compiling code like
> > 
> >     goto *a[i];
> > 
> > into
> > 
> >         movl    a.0(,%ebx,4), %eax
> >         jmp     *%eax
> > 
> > instead of
> > 
> > 	jmp	*a.0(,%ebx,4)
> 
> The combiner should have already combined these insns before the duplication of
> computed gotos is run. This pass just duplicates some insns.

My guess is that the combiner does not combine these insns before
duplication, because they are in different basic blocks before
duplication.  Without duplication the code would look like:

         movl    a.0(,%ebx,4), %eax
         jmp     .L9
         ...
.L9:
         jmp     *%eax

- anton
Comment 19 zlomj9am 2004-07-16 08:42:24 UTC
Subject: Re:  pessimization of "goto *"

> > Adding an new flag is possible. However, I do not see why you explicitelly want
> > to disable reorder-blocks.
> 
> We need it for the dynamic superinstruction optimization (aka
> selective inlining, PLDI'98 p. 291) in Gforth and further
> optimizations building on that (other people working on efficient
> interpreters/dynamic code generators have the same problem; e.g., I
> recently talked to Etienne Gagnon (http://sablevm.org/) about this;
> qemu also uses -fno-reorder-blocks).
> 
> For dynamic superinstructions we need to ensure that the machine code
> corresponding to source code between two labels is between the
> addresses that gcc produces for the labels.  The flag
> -fno-reorder-blocks almost gives us that (but the sharing of the
> "goto *"s breaks that).

I see.

> Even for just plain interpreters (without dynamic superinstructions),
> the sharing of the "goto *"s resulting from the merging of code after
> conditional jumps (as shown in the simplified testcase) will often
> reduce performance significantly through lower BTB accuracy.  So for
> performance one probably wants to use -fno-reorder-blocks there,
> unless this results in merging the "goto *"s, as it does now.

I see. So duplication of computed jumps probably should be at -O
and have a special flag too.

Josef
Comment 20 bernd.paysan 2004-10-03 20:25:43 UTC
On Wednesday 07 July 2004 06:49, zlomj9am at artax dot karlin dot mff dot cuni  
dot cz wrote: 
> cfun->computed_goto_common_label can be replaced by 
> cfun->has_computed_jump. The test is there not to scan the function if the 
> computed goto is not used in the function. 
 
Since your patch is still not in the gcc 4.0.0-xxx codebase, I've modified it  
to work with the current CVS version. There's still the problem that the jump  
does not combine with the previous instruction, and that therefore one  
valuable (especially on x86!) register is used up. As trigger condition for  
goto duplication I used the -fno-crossjumping switch, together with  
optimization. 
 
I've put the patch online under 
 
http://www.jwdt.com/~paysan/gcc-4.0.0.patch 
Comment 21 Giovanni Bajo 2004-10-04 12:21:22 UTC
Bernd, the patch needs to be posted to gcc-patches@ after a full bootstrap & 
test cycle to gather attention. Probably Josef should take care of this (being 
the assigneed of this bug).
Comment 22 Steven Bosscher 2004-10-25 21:10:21 UTC
I have a patch that fixes this.  I need to figure out a way to do it in a way 
that doesn't require you to pass "-fno-crossjumping -fno-gcse" but I'll figure 
out something... 
 
 
Comment 23 Steven Bosscher 2004-12-18 16:46:51 UTC
Comes from factoring computed gotos, so this is a code quality regression. 
 
 
 
 
Comment 24 Steven Bosscher 2005-01-27 08:33:06 UTC
Created attachment 8079 [details]
Updated patch for this problem
Comment 25 Andrew Pinski 2005-01-27 13:44:54 UTC
Patch here: <http://gcc.gnu.org/ml/gcc-patches/2005-01/msg02001.html>.
Comment 26 CVS Commits 2005-02-01 10:03:57 UTC
Subject: Bug 15242

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	steven@gcc.gnu.org	2005-02-01 10:03:27

Modified files:
	gcc            : ChangeLog basic-block.h bb-reorder.c 
	                 cfgcleanup.c params.def passes.c 
	gcc/doc        : invoke.texi 

Log message:
	PR optimization/15242
	* params.def (PARAM_MAX_GOTO_DUPLICATION_INSNS): New param.
	* basic-block.h (duplicate_computed_gotos): Add prototype.
	* bb-reorder.c (duplicate_computed_gotos): New function to
	duplicate sufficiently small blocks ending in a computed jump.
	* passes.c (rest_of_compilation): Call duplicate_computed_gotos
	if not optimizing for size.
	* cfgcleanup.c (try_crossjump_bb): If not optimizing for size,
	never do tail merging for blocks ending in a computed jump.
	* doc/invoke.texi: Document the max-goto-duplication-insns param.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7359&r2=2.7360
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/basic-block.h.diff?cvsroot=gcc&r1=1.236&r2=1.237
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/bb-reorder.c.diff?cvsroot=gcc&r1=1.88&r2=1.89
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/cfgcleanup.c.diff?cvsroot=gcc&r1=1.138&r2=1.139
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/params.def.diff?cvsroot=gcc&r1=1.53&r2=1.54
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/passes.c.diff?cvsroot=gcc&r1=2.64&r2=2.65
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/doc/invoke.texi.diff?cvsroot=gcc&r1=1.571&r2=1.572

Comment 27 Steven Bosscher 2005-02-01 10:11:28 UTC
Should be fixed on mainline now. 
Comment 28 Kazu Hirata 2005-02-07 05:28:52 UTC
Is this PR fixed?  If so, please close it.
Comment 29 Andrew Pinski 2005-02-07 05:32:48 UTC
(In reply to comment #28)
> Is this PR fixed?  If so, please close it.

it is fixed on the mainline but still not fixed on the 3.4 branch as this is a regression we should keep it 
open until The RM says otherwise.
Comment 30 anton 2005-02-27 10:47:21 UTC
Subject: Re:  [3.3/3.4/4.0 regression] pessimization of "goto *"

steven at gcc dot gnu dot org wrote:
> Updated patch for this problem

Ok, I have now tried it with gcc-4.0-20050220 and gforth-0.6.2 on i386
and it works ok.  There is a minor remaining regression related to this
issue:

gcc-4.0 compiles "goto **(ip-1);" into

0x804b50d <engine+1341>:        mov    0xfffffffc(%ebx),%eax
0x804b510 <engine+1344>:        jmp    *%eax

gcc-2.95 compiles it into

0x804b103 <engine+227>: jmp    *0xfffffffc(%ebx)

which looks better.

Maybe there should be another combining pass after the duplication of the indirect jumps.  Should I create another PR for this?

Here are updated timing results (gforth-0.6.2 configured with
--enable-force-reg):

Pentium 4 2.26GHz; numbers are times in seconds user time
                                                  gcc-4.0-20050220 
	   gcc-2.95.3	   gcc-3.3	gcc-3.4.0 default no-reorder-blocks
	dynamic no-dyn	dynamic	no-dyn	no-dyn	  dynamic dynamic
siev    0.24    0.48    0.31    0.47    0.50      0.37    0.28
bubble  0.30	0.78	0.36	0.77	0.78	  0.40	  0.36
matrix  0.19	0.94	0.17	0.92	0.96	  0.18	  0.18
fib     0.34	0.57	0.41	0.58	0.59	  0.43	  0.40

- anton


Comment 31 Steven Bosscher 2005-03-10 12:48:03 UTC
> Maybe there should be another combining pass after the duplication
> of the indirect jumps.  Should I create another PR for this?

There should not be another "combining" pass (you really mean constant
propagation).  This new unfactoring stuff runs after register allocation,
so such a pass would not really help, except maybe to make the code look
prettier to you.

But, is this:

        mov    0xfffffffc(%ebx),%eax
        jmp    *%eax

slower than this:

        jmp    *0xfffffffc(%ebx)

or have you not tried that (e.g. by hacking the assembly by hand)?
Comment 32 anton 2005-03-12 21:38:44 UTC
Subject: Re:  [3.3/3.4 regression] pessimization of "goto *"

steven at gcc dot gnu dot org wrote:
> 
> 
> ------- Additional Comments From steven at gcc dot gnu dot org  2005-03-10 12:48 -------
> > Maybe there should be another combining pass after the duplication
> > of the indirect jumps.  Should I create another PR for this?
> 
> There should not be another "combining" pass (you really mean constant
> propagation).

I meant "Instruction combination (`combine.c')".  Not sure if this is
replaced by something else in the recent gccs.  Why do you think I
mean constant propagation?

>  This new unfactoring stuff runs after register allocation,
> so such a pass would not really help, except maybe to make the code look
> prettier to you.

Ouch.  No way to fix that?  That's the cost we wanted to avoid.

> But, is this:
> 
>         mov    0xfffffffc(%ebx),%eax
>         jmp    *%eax
> 
> slower than this:
> 
>         jmp    *0xfffffffc(%ebx)
> 
> or have you not tried that (e.g. by hacking the assembly by hand)?

Ok, I hacked the assembly by hand, and this is what I got:

All numbers are user times in seconds for gforth-fast-0.6.2:

Pentium-4 2.26 GHz (i386 code):
           no-dynamic  no-super   dynamic    
combined?  yes  no     yes  no	  yes  no    
siev       0.47 0.49   0.36 0.36  0.33 0.33  
bubble     0.81 0.81   0.52 0.53  0.47 0.47  
matrix     1.03 1.01   0.30 0.30  0.36 0.35  
fib        0.70 0.68   0.75 0.60  0.53 0.58  

Opteron 2GHz (i386 code):
           no-dynamic  no-super   dynamic    
combined?  yes  no     yes  no	  yes  no    
siev       0.46 0.47   0.37 0.36  0.33 0.32
bubble     0.73	0.74   0.50 0.51  0.50 0.51
matrix     0.93	0.95   0.35 0.34  0.31 0.32
fib        0.63	0.64   0.49 0.50  0.44 0.45

"No-super" performs the same number of indirect branches (and anything
else) as "no-dynamic", but has better branch prediction.  "Dynamic" is
like "no-super", but eliminates many of the indirect branches.

So, overall the instruction combination alone does not make much of a
difference on these CPUs.

- anton

Comment 33 Steven Bosscher 2005-03-12 21:54:08 UTC
Subject: Re:  [3.3/3.4 regression] pessimization of "goto *"

Combine runs before register allocation.  You cannot run it after
register allocation.  I don't think you can run it twice, even.

And no, after register allocation you are not magically going to
win back that register.  Tough luck.

Given your numbers, replacing the reg with its known value might
still be a win.  But I'm not going to do it ;-)
Comment 34 Steven Bosscher 2005-10-07 14:09:24 UTC
This is basically unfixable without serious hacking that is not appropriate for GCC 3.4, so Gaby, with your permission, I'd like to close this one as WONTFIX...
Comment 35 Gabriel Dos Reis 2005-10-07 20:00:31 UTC
Subject: Re:  [3.4 regression] pessimization of "goto *"

"steven at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org> writes:

| This is basically unfixable without serious hacking that is not
| appropriate for GCC 3.4, so Gaby, with your permission, I'd like to
| close this one as WONTFIX...

I agree with your analysis.  

Thanks,

-- Gaby
Comment 36 Steven Bosscher 2005-10-07 20:06:06 UTC
Not really fixable for 3.4.
Comment 37 Andrew Pinski 2006-04-26 19:13:44 UTC
Reopening to close as ...
Comment 38 Andrew Pinski 2006-04-26 19:14:16 UTC
Fixed in 4.0.0.
Comment 39 Andrew Pinski 2006-04-26 19:14:55 UTC
*** Bug 8092 has been marked as a duplicate of this bug. ***