Bug 25285 - pessimization of goto * ("computed goto")
pessimization of goto * ("computed goto")
Status: RESOLVED INVALID
Product: gcc
Classification: Unclassified
Component: rtl-optimization
4.0.0
: P3 normal
: ---
Assigned To: Not yet assigned to anyone
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-12-06 19:56 UTC by anton
Modified: 2005-12-09 11:51 UTC (History)
1 user (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:


Attachments
test case input (441.48 KB, text/plain)
2005-12-06 20:00 UTC, anton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description anton 2005-12-06 19:56:37 UTC
This is the gcc-4.0 reincarnation of PR 15242.  Since about gcc-3.2 or
so gcc tends to compile "goto *" into direct jumps to a shared
indirect jump.  Gcc-4.0 tries to undo this in a later stage, but
apparently it is not completely successful:

This is a fragment from the engine1.i file I will
attach (with some newlines removed:

H_IALOAD: __asm__(""); I_IALOAD:
{
java_arrayheader * aArray;
s4 iIndex;
Cell vResult;
;
((aArray) = (java_arrayheader * )(sp[1]));
((iIndex)=(s4)(Cell)(spTOS));

sp += 1;
{
# 188 "./java.vmg"
{
  { if ((aArray) == ((void *)0)) { goto *throw_nullpointerexception; } };
  { if (( ((java_arrayheader*)(aArray))->size ) <= (u4) (iIndex)) { arrayindexoutofbounds_index = (iIndex); goto *throw_arrayindexoutofboundsexception; } };
  ;
  vResult = ((((java_intarray*)(aArray))->data)[iIndex]);
}
# 332 "java-vm.i"
}

;
((spTOS) = (Cell)(vResult));
J_IALOAD: __asm__("");
do {ca=*(ip++);} while(0);
K_IALOAD: __asm__("");
goto before_goto;
}

After compiling this with "gcc-4.0.0 -fno-reorder-blocks -O2 -g3 -S
engine1.i", the assembly output for this fragment is:

.L995:
	jmp	*%rdx
...
.L9:
.LBB46:
	.loc 115 314 0
	addq	$8, %r15
	.loc 115 315 0
	movl	-168(%rbp), %eax
	.loc 114 189 0
	movq	-136(%rbp), %rdx
	.loc 115 314 0
	movq	(%r15), %r9
	.loc 114 189 0
	testq	%r9, %r9
	je	.L995
	.loc 114 190 0
	cmpl	%eax, 16(%r9)
	ja	.L764
	movq	-152(%rbp), %rdx
	movl	%eax, -116(%rbp)
.LBE46:
	.loc 2 231 0
	jmp	*%rdx
.L764:
.LBB47:
	.loc 114 192 0
	cltq
	movslq	24(%r9,%rax,4),%r9
	movq	%r9, -168(%rbp)
.L195:
	.loc 115 342 0
	.loc 115 343 0
	movq	(%r14), %r13
	addq	$8, %r14
.L381:
	.loc 115 344 0
	jmp	.L560


So while gcc managed to reconstruct the second indirect jump, it did
not succeed for the first "goto *", which is pessimised into a
conditional branch to a shared indirect jump.

Code coming from "gcc version 4.0.2 (Debian 4.0.2-2)" or gcc-4.0.0
without -fno-reorder-blocks is similar.

The impact of this pessimisation is that we cannot use "selective
inlining" for JVM instructions that can throw exceptions, like
"getfield"; a rough guess at the resulting slowdown for the Cacao JVM
interpreter is a factor 1.2.

Hmm, I guess that the intermediate direct unconditional jump is
optimised away, and that leads to the inability to reconstruct the
indirect jump.  Maybe I can work around this problem by putting an
__asm__("") or a label between the if and the "goto *" to prevent the
optimisation, but:

1) Will the workaround still work with the next gcc?

2) The workarounds start to accumulate.
Comment 1 anton 2005-12-06 20:00:18 UTC
Created attachment 10425 [details]
test case input
Comment 2 Andrew Pinski 2005-12-06 20:46:09 UTC
__asm__(""); makes stuff worse.

Anyways it looks like the code in GCC only deals with non conditional jumps (unless I am missing something).
Is there a conditional jump of an indirect jump for x86/x86_64?
Comment 3 Andrew Pinski 2005-12-06 20:57:29 UTC
Since there is no conditional indirect jumps on x86 this is not a bug.
Comment 4 anton 2005-12-06 21:46:17 UTC
Subject: Re:  pessimization of goto * ("computed goto")

pinskia at gcc dot gnu dot org wrote:
> __asm__(""); makes stuff worse.

I just applied the following patch to engine1.i:

--- engine1.i	2005-12-06 22:34:08.000000000 +0100
+++ engine1.i~	2005-12-06 19:04:12.000000000 +0100
@@ -17562,7 +17562,7 @@
 {
 # 188 "./java.vmg"
 {
-  { if ((aArray) == ((void *)0)) { __asm__(""); goto *throw_nullpointerexception; } };
+  { if ((aArray) == ((void *)0)) { goto *throw_nullpointerexception; } };
   { if (( ((java_arrayheader*)(aArray))->size ) <= (u4) (iIndex)) { arrayindexoutofbounds_index = (iIndex); goto *throw_arrayindexoutofboundsexception; } };
   ;
   vResult = ((((java_intarray*)(aArray))->data)[iIndex]);

The result looks much better; in particular, instead of the

        je      .L995

I now get:

	jne	.L762
	movq	-136(%rbp), %rdx
.LBE46:
	.loc 2 231 0
	jmp	*%rdx
.L762:

I.e., a separate indirect jump for improved branch prediction (not
important in this case, because this indirect branch will rarely be
used).  More importantly, no direct branches to far-away places which
did destroy the relocatability (and thus suitability for "selective
inlining") of this code fragment.

So, no, the code is not worse, but much better.  I hope this
workaround will continue to work in future versions.

- anton
Comment 5 Andrew Pinski 2005-12-06 21:58:03 UTC
(In reply to comment #4)
> So, no, the code is not worse, but much better.  I hope this
> workaround will continue to work in future versions.

You are wrong in general since this is a conditional indirect jump.  Since it is conditional it means that it is going to do a jump and the locatity reasons are that important as like in the old days when there was a little code cache.  In fact have doing jne instead of jeq might cause the branch mispridected.  Note if you were actually using a target which have conditional indirect jumps this would be a bug (PPC for an example from either lr or ctr register, see PR 25287 for a bug report about that).
Comment 6 anton 2005-12-08 21:31:30 UTC
Subject: Re:  pessimization of goto * ("computed goto")

pinskia at gcc dot gnu dot org wrote:
> ------- Comment #5 from pinskia at gcc dot gnu dot org  2005-12-06 21:58 -------
> (In reply to comment #4)
> > So, no, the code is not worse, but much better.  I hope this
> > workaround will continue to work in future versions.
> 
> You are wrong in general since this is a conditional indirect jump.  Since it
> is conditional it means that it is going to do a jump and the locatity reasons
> are that important as like in the old days when there was a little code cache. 
> In fact have doing jne instead of jeq might cause the branch mispridected. 

Sorry, you lost me here.  Conditional branch predictors on current
general-purpose CPUs are history-based, and I would not expect any
difference in the accuracy of the conditional branch prediction.
However, for BTB-based indirect branch predictors (Pentium 3, Athlon
64, and (modulo replication from the trace cache) Pentium 4), the
branch prediction accuracy suffers quite a lot if you combine several
well-predictable indirect branches with different targets into a
single indirect branch.

See [ertl&gregg03] for a deeper discussion, in particular Section 3.
You can also read in Section 5.2 (towards the end) why we don't want
to have a jump to far-away places.

> Note if you were actually using a target which have conditional indirect jumps
> this would be a bug (PPC for an example from either lr or ctr register, see PR
> 25287 for a bug report about that).

Sure, having a conditional indirect jump in-line would be nice.

But if the architecture does not have it (or if gcc does not utilize
it), what I would like to see in the resulting code is:

1) We compile with -fno-reorder-blocks, so the indirect branch should
be in the place corresponding to the source code, not somewhere else.

2) If you do reorder the blocks, you should not merge indirect
branches on CPUs with BTBs, for better branch prediction.

BTW, the __asm__("") workaround works nicely (for now), so I could
produce numbers for the slowdown for this bug, if you are interested.

@InProceedings{ertl&gregg03,
  author =       "M. Anton Ertl and David Gregg",
  title =        "Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters",
  crossref =     "sigplan03",
  OPTpages =	 "",
  url =		 "http://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz",
  abstract =     "Interpreters designed for efficiency execute a huge
                  number of indirect branches and can spend more than
                  half of the execution time in indirect branch
                  mispredictions.  Branch target buffers are the best
                  widely available\mn{on all recent general-purpose
                  machines?} form of indirect branch prediction;
                  however, their prediction accuracy for existing
                  interpretes is only 2\%--50\%.  In this paper we
                  investigate two methods for improving the prediction
                  accuracy of BTBs for interpreters: replicating
                  virtual machine (VM) instructions and combining
                  sequences of VM instructions into superinstructions.
                  We investigate static (interpreter build-time) and
                  dynamic (interpreter run-time) variants of these
                  techniques and compare them and several combinations
                  of these techniques.  These techniques can eliminate
                  nearly all of the dispatch branch mispredictions,
                  and have other benefits, resulting in speedups by a
                  factor of up to 3.17 over efficient threaded-code
                  interpreters, and speedups by a factor of up to 1.3
                  over techniques relying on superinstructions alone."
}

- anton
Comment 7 Richard Biener 2005-12-09 11:51:54 UTC
> 2) If you do reorder the blocks, you should not merge indirect
> branches on CPUs with BTBs, for better branch prediction.

I would rather say that you should not merge frequently executed
indirect branches.  There is certainly an upper bound of indirect
branches after that merging is profitable again, and only disabling
merging of frequently executed (from profile feedback) indirect
branches and retaining merging of the seldom executed ones will
probably be the best idea.
Comment 8 anton 2005-12-09 13:46:34 UTC
Subject: Re:  pessimization of goto * ("computed goto")

rguenth at gcc dot gnu dot org wrote:
> 
> 
> 
> ------- Comment #7 from rguenth at gcc dot gnu dot org  2005-12-09 11:51 -------
> > 2) If you do reorder the blocks, you should not merge indirect
> > branches on CPUs with BTBs, for better branch prediction.
> 
> I would rather say that you should not merge frequently executed
> indirect branches.

Yes, that's a refinement.

> There is certainly an upper bound of indirect
> branches after that merging is profitable again,

Yes, once the indirect branch gets evicted from the BTB between two
executions, you might just as well merge it with another indirect
branch of the same kind.  While not all rarely executed indirect
branches will have this property (the few executions might occur in a
burst), many of them will, and for the rest the number of executions
and thus the penalty of doing it wrong will be low.

OTOH, the potential speed benefit of this refinement is also low; so
the main benefit is probably in code size.

- anton